FINAL REPORT 

Transient Faults in Computer Systems 
NASA Grant No. NSG-1442 


Gerald M. Masson 
Principal Investigator 


yiiiiiltili 


FINAL REPORT 

Transient Faults in Computer Systems 

NASA Grant No. NSG-1442 

Gerald M. Masson 
Principal Investigator 
Department of Computer Science 
The Johns Hopkins University 
Baltimore, Maryland 21218-2694 
Phone: (410) 516-7013 
FAX: (410) 516-6134 
Email: massonQcs.jhu.edu 


Summary 

We have developed by means of support from NASA Grant NSG-1442 a novel and powerful 
technique particularly appropriate for the detection of errors caused by transient faults in computer 
systems. The technique can be implemented in either software or hardware; the research conducted 
thus far primarily has considered software implementations. The error detection technique we have 
developed has the distinct advantage of having provably complete coverage of all errors caused by 
transient faults that affect the output produced by the execution of a program. In other words, the 
technique does not have to be tuned to a particular error model to enhance error coverage. Also, 
the correctness of the technique can be formally verified. — — 

When implemented in software, this new technique uses time and software redundancy and can 
be outlined as follows. In the initial phase, a program is run to solve a problem and store the 
result. In addition, this program leaves behind a trail of data which we call a certification traiL In 
the second phase, another program is run which solves the original problem again. This program, 
however, has access to the certification trail left by the first program. Because of the availability 
of the certification trail, the second phase can be performed by a less complex program and can 
execute more quickly. In the final phase, the two results are compared and if they agree the results 
are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is 
that the second program must always generate either an error indication or a correct output even 
when the certification trail it receives from the first program is incorrect. We have formalized the 
certification trail approach to fault tolerance and have illustrated numerous realizations of it for __ 
well-know and important problems. We have rigorously proven the correctness of the technique 
for certain applications. We have shown cases in which the second phase can be run concurrently 
with the first and act as a real-time monitor. We have compared the certification trail approach 
to other approaches to error detection to demonstrate the significant conceptual and performance 
advantages. 

This research has developed the foundation for an effective, low-overhead, software-based cer- 
tification trail approach to real-time error detection resulting from transient fault phenomena. It 
would be particularly appropriate at this time to examine the technique further in the context 
of important and timely applications. For example, transient error phenomena caused by ioniz- 
ing radiations in space or high-altitude avionics environments stand as a major obstacle to many 


1 


applications of high performance microelectronics. The research reported in the following woidd 
provide a framework for the development of “radiation- hardened software” that wotild permit the 
utilization of high performance microelectronics in space and high-altitude avionics applications in 
an efficient and cost effective manner. 

In the following, seven papers are provided which together characterize the current state of the 
most recent research conducted with support from NASA Grant NSG-1442: 

1. Certification of Computational Results^ Gregory F. Sullivan, Dwight S. Wilson, Gerald M. 
Masson. 

2. Experimental Evaluation of the Certification- Trail Methody Gregory F. Sullivan, Dwight S. 
Wilson, Gerald M. Masson, Mamoru Itoh, Warren W. Smith, Jonathan S. Kay. 

3. Certification Trails and Software Design for Testahility^ Gregory F. Sullivan, Dwight S. Wil- 
son, Gerald M. Masson. 

4. Experimental Evaluation of Certification Trails using Abstract Data Type Vdlidationy Dwight 

S. Wilson, Gregory F. Sullivan, Gerald M. Masson. 

5. United States Patent, Method and Apparatus for Fault Tolerancey Patent No. 5,243,607, Sept. 
7, 1993, United States Patent Office. 

6. Using Certification Trails to Achieve Software Fault Tolerance^ Gregory F. Sulhvan, Gerald 
M. Masson. 

7. Certification Trails for Data Structures^ Gregory F. Sullivan, Ger£Jd M. Masson. 


2 



Figure 1: Certification trail method. 


the software in addition to those caused by transient hardware faults and utilizes both time and 
software redundancy. Errors caused by software faults are detected whenever the independently 
written programs do not generate coincident errors. 

A significant drawback to the above approaches is the overhead required. Either extra time 
is required to run the algorithms serially on a single processor or extra hardware is required to 
run them in parallel. The technique we will describe is designed to achieve similar types of error 
detection capabilities while reducing the required resource overhead. The central idea, as illustrated 
in Figure 1, is to modify the first algorithm so that it leaves behind a trail of data which we call a 
certification trail. This data is chosen to allow the second algorithm to execute more quickly and/or 
have a simpler structure than the first algorithm. As above, the outputs of the two executions are 
compared and are considered correct only if they agree. Note, however, that we must be careful in 
defining this method or else its error detection capability might be reduced by the introduction of 
data dependency between the two algorithm executions. For example, suppose the first algorithm 
execution contains an error which causes an incorrect output and an Incorrect trail of data to be 
generated. Further suppose that no error occurs during the execution of the second algorithm. It 
appears possible that the execution of the second algorithm might use the incorrect trail to generate 
an incorrect output which matches the incorrect output produced by the first algorithm. Intuitively, 
we can regard the two executions as “adversaries.” The second execution must guard against an 
incorrect certification trail “fooling” it into producing an incorrect output. The definitions we give 
below exclude this possibility. They demand that the second execution either generates a correct 
answer or signals the fact that an error has been detected in the certification tra.il. 

2 Formal Definition of a Certification TVail 

In this section we will give a formal definition of a certification trail and discuss some aspects of 
its realizations and uses. 

Definition 2.1 A problem P is formalized as a relation, i.e., a set of ordered psdrs. Let D be the 
dommn (that is, the set of inputs) of the relation P and let S be the range (that is, the set of 
solutions) for the problem. We say an algorithm A solves a problem P iff for all d e D when d is 
input to A then an s € S is output such that (d, «) € P. 


2 


PlWCiWNS PAGE iLANK NOT FILMED 





Definition 2.2 Let P : D — ► S be a problem. A solution to this problem using a certification 
trail consists of two functions Fi and Fj with the following domains and ranges Fi : D — ► S x T 
and Fi : D X T -♦ S U {error}. T is the set of certification trails. The functions must satisfy the 
following two properties: 

(1) for all d € D there exists a € S and there exists t € T such that 

Fi(d) = (a,0 and Fi(d, () = a and (d, a) € P 

(2) for all d 6 D amd for all t € T 

either (Fi(d, t) = a and (d,a) e P) or Fi(d,t) = error. 


m 

m 


We also require that Fi amd Fi be implemented so that they map elements not in their respective 
domains to the error symbol. The definitions above assure that the error detection capability of 
the certification trail approach is comparable to that obtained with the simple time redundancy 
approach discussed eao'lier. (That is, if transient hardware faults occur during only one of the 
executions then either an error will be detected or the output will be correct.) It should be further 
noted, however, that the examples to be considered wUl indicate that this approach can also save 
overall execution time. 

The certification trail approach also allows for the detection of faults in software. As in 2- 
version progr amming , separate teams can write the algorithms for the first and second executions. 
Note that the specification now must include precise information describing the generation and 
use of the certification trail. Because of the additional data available to the second execution, 
the specifications of the two phases can be very different; similaurly, the two algorithms used to 
implement the phases can be very different. (This will be illustrated in the convex hull example to 
be considered later.) Alternatively, the two algorithms can be very similar, differing only in data 
structure manipulations. (This will be illustrated in the shortest path example to be considered 
later.) When significantly different algorithms are used, the probability that both algorithms will 
contain or be affected by faults which generate matching errors should be reduced. When very 
similar algorithms are used it is sometimes possible to save programming effort by sharing program 
code. For example, the code implementing any data structures needed by the program might be 
different, while the code that uses the data structure operations would be the same. This approach 
is well suited for the creation of libraries of fault-tolerant data structures. While this reduces the 
ability to detect errors in the software it does not change the ability to detect transient hardware 
errors as discussed earlier. Furthermore, in situations like the above example, it is possible (perhaps 
even probable) that the majority of software errors will be in the data structure implementation. 
Thus the ability to detect software errors may not be reduced as much as first imagined. 

Throughout this section we have assumed that our method is implemented with software, how- 
ever, it is clearly possible to implement the method with assistance from dedicated hardware. It 
is also possible to generalize the basic idea we have suggested. We discuss some of these gener- 
alizations in a later section. Finally, we note that a wide variety of approaches to software fault 
tolerance have been proposed and we contrast our method to the most closely related ideas in a 
later section. 

In the following two sections we illustrate the application of certification trails to three well- 
known and significant problems in computer science: the convex hull problem, sorting, and the 
shortest path problem. It should be stressed that the certification trail is not limited to these 
problems. Rather, these algorithms have been selected for Ulustrative purposes. 


3 


3 Certification Trails for Convex Hulls 


The convex hull problem is a fundamental one in computational geometry. Our certification trail 
solution is based on a solution due to Graham [13] called Graham’s Scan. For basic definitions in 
computational geometry see the text of Preparata and Shamos [20]. This text also illustrates some 
statistical applications of convex hull computations. For simplicity in the following discussion we 
will assume the points are in so called general position, i.e., no three points are co-linear. It is not 
difficult to remove this restriction. 

Definition 3.1 The convex hull of a set of points, 5, in the Euclidean plane is defined as the 
smallest convex polygon enclosing all the points. This polygon is unique and its vertices are a 
subset of the points in 5. It is specified by a counterclockwise sequence of its vertices. 

The algorithm pven below constructs the convex hull incrementally in a counterclockwise fash- 
ion. Sometimes it is necessary for the algorithm to “backup” the construction by throwing some 
vertices out and then continuing. The first step of the algorithm selects the point with minimum 
x-coordinate (using minimum y-coordinate to break ties), and calls it pi. For each other point q 
in S we compute the slope of the line piq. Sort the points of 5 (except for pi) by this slope (since 
the points are in general position, the slopes are distinct). Number these vertices pj,p 3 , . 

It is not hard to show that after these three steps the points when taken in order, Pi,P7, ■ ■ - ,Pni 
form a simple polygon; although this polygon might not be convex. It is possible to think of the 
algorithm as removing points from this simple polygon until it becomes convex. This code below 
performs this by “walking” through the vertices in order. The main FOR loop iteration aidds points 
to the polygon under construction. After a point is added, the inner WHILE loop checks the angle 
formed by the addition of this point. (Note: We measure angles as foDows: Given the three points 
9m-i,qmiPk we measure the angle from qm-i^m to qmPk in the clockwise direction.) If the angle 
is not acute (i.e., it makes the the polygon non-convex), then the angle vertex (i.e., the preceding 
point on the polygon) is removed. Note that this will change the preceding angle, which may 
now be obtuse and should be eliminated. The WHILE loop terminates when an acute angle is 
encountered. Figure 2 illustrates the construction of a convex hull using this algorithm, from the 
hull. 

When the main FOR loop is complete the convex hull has been constructed. 

Algorithm CONVEXHULL(5) 

Input: Set of points, S, in 

Output: Counterclockwise sequence of points in which define convex hull of 5 

1 Let Pi be the point with the smallest x coordinate (and smallest y to break ties) 

2 For each point p (except p\ ) calculate the slope of the line through pi and p 

3 Sort the points (except pi ) from the smallest slope to the largest. 

Call them pa,. . .,p„ 

4 9i := Pi; ft := Pa; fe := Ps: m = 3 

5 FOR = 4 to n DO 

6 WHILE the angle formed by qm-\i<lmiPk is > 180 degrees DO 

7 m := m — 1 

8 END WHILE 

9 m := m -I- 1 

10 qm := Pk 

11 END FOR 

12 FOR »• = 1 to m DO, OUTPUT(?i) END FOR 


4 




Figure 2: Convex hull example. 


END CONVEXHULL 

" First execution: To generate a certihcation trail for this algorithm, we rely on the property 
that for each point eliminated by the WHILE loop in the code above, we can produce a triangle of 
points in S containing the eliminated point. 

Theorem 3.2 Let p, a, b, and c, be points in the plane such that no three are co-Unear, p has the 
smallest x -coordinate of the four points (and the smaller y -coordinate if another other point has the 
^ same x-coordinate) slope(pa) < slope(^) < slope^pc). If the angle abc is obtuse (measured in the 
clockwise direction), then b is inside the triangle pac. 

By the ordering of the slopes, 6 is inside the triangular wedge determined by the rays 
and pc. Note that the line segments pa and pc are in the half plain x > and in at least one 
case the inequality is strict, since no three points are co-linear. This implies that the angle ape (in 
= the clockwise direction) must be greater than 180 degrees. Since the angle ohc is also obtuse, both 
” p and b must be on the same side of line ac. Therefore, b is inside the triangle pac. § 

_ Corollary 3.3 During execution of CONVEXHULL, if, after adding pk, the angle formed by 
9m, Pk is obtuse (measured in the clockwise direction), then w contained in the triangle 

__ Pi » 9m— 1 5 PJt' 

— Proof: s/ope(pifl„,_,) < slope{p(^) < slope(^(pj;). | 


5 



In the first execution the code CONVEXHULL is used. The certification trail is generated by 
adding an output statement within the WHILE loop. Specifically, if an angle greater than 180 
degrees is found in the WHILE loop test then the 4-tuple consisting of is output to 

the certification trail. The table below shows the 4-tuples of points that would be output by the 
algorithm when run on the example in Figure 2. The points in the table are given the same names 
as in Figure 2. The final convex hull points are also output to the certification trail. 

Finally, the trail output does not consist of the axtual points in R^. Instead, it consists of indices 
to the original input data. This means if the original data consists of si, S2, . . . , «n rather than 
output the element in corresponding to s, the number i is output. If point coordinates were 
output instead of these indices, the second execution would have to verify that the points on the 
trail are members of S. 

Point not on convex huU Three surrounding points 

P3 P4,Pl,P2 

Pi P6,Pl,Pi 

P7 PSiPltPS 


Second execution: Let the certification trail consist of a set of 4-tuples, (zj, aj, hj, cj), 

(Xr,Or,br,Cr) foDowcd by the supposed convex hull, ■ ■ -ifm- The code for CONVEX- 
HULL is not used in this execution. Indeed, the algorithm performed is dramaticaOy different than 
CONVEXHULL. 

It consists of five checks on the trail data. 

i. That there is a one to one correspondence between the input points and the points in 
{Zj, . . . , z,} U {91 , . . . , 9m}- 

ii. That for i € {1 , . . ., r}, Oj, 6,, and c, are among the input points. 

iii. For t € {1 , . . ., r} that z, lies within the triangle defined by and c,-. 

iv. That for each triple of counterclockwise consecutive points on the supposed convex hull the 
angle formed by the points is acute. 

V. That there is a unique point among the points on the supposed convex hull which is a locally 
maximal point. We say a point 9 on the hull is a local maximum point if its predecessor in the 
counterclockwise ordering has a strictly smaller y coordinate and its successor in the ordering 
has a smaller or equal y coordinate. 

If any of these checks fail then execution halts and “error” is output. As mentioned above, the 
tradl data actually consists of indices into the input data. This does not unduly complicate the 
checks above; in fact it makes it easier to verify the first and second conditions. 

Time complexity: In the first execution the sorting of the input points tsJces 0(nlog(n)) time 
where n is the number of input points. One can show that this cost dominates and the overall 
complexity is 0(n log(n)). 

It is possible to implement the second execution so that all five checks are done in 0(n) time. 
Because indices into the input data are used, the first condition can be checked by verifying that 
each index is used exactly once, and that all indices are between 1 and N. The second condition 
may checked simply by verifying that each index is between 1 and N. Checking that a point lies 


6 



within a triangle is a geometric calculation that can be done in constant time. Checking that the 
angle formed by three points is acute requires only constant time. The third and fourth checks can 
be done in 0(n) because the certification trail contains indices into the input data as described 
above. The uniqueness of the “local maximum” requires only a constant time calculation at each 
point, so it may checked in linear time. 

Experimental timing data for this method may be found in Section 6. 

3.1 Proof of correctness 

We wish to prove that the algorithms above constitute a certification trail solution for the convex 
hull problem. Although the definition is phrased in terms of functions, not algorithms, we can 
simply define the functions Fi{d) and F 2 (d,t) on particular arguments as the values computed by 
the associated algorithms. 

Using our formal definition of certification trails, let D be the set of all finite planar point sets 
T. Let S be the set of convex polygons, with vertices in counterclockwise order (the restriction to 
counterclockwise ordering makes the convex hull unique). Then the problem we are considering is 
HULL : D — » S where HULL{T) is the polygon in S that forms the convex hull of T. 

The description of the algorithms above defines functions F\ and /j. We must show that both 
conditions of Definition 2.2 hold. The following two lemmas, which we state without proof, are 
required. 

Lemma 3.4 Let P be a polygon on n points pi,p 2 , . . .,Pn- P is a convex polygon iff P is simple 
and each angle PiPjPk is less than or equal to 180 degrees, where i is in 1,2, ...n, j = (i + 1) mod n, 
and k = (i + 2) mod n. 

Lemma 3.5 If P is a non-simple polygon, then either P has more than one local maxima, or the 
interior angle at some vertex is greater than 180 degrees. 

Theorem 3.6 Fi{d) and Fj(d, t), as defined above, constitute a certification trail solution for the 
problem HULL. 

Proof: We must prove that both conditions of Definition 2.2 are satisfied by these functions. 

Part 1: Recall that the first condition is: for all d € O there exists s 6 S and f 6 T such 

that Fi{d) = (s, t) and F 2 {d,t) = s and (d,s) € P. Intuitively, this means that if both executions 
perform correctly, then they will both output the convex hull of the input, which is unique. Note 
that generation of the certification trail does not affect the output of the Graham Scan algorithm. 
Thus the condition on Fi{d) is satisfied by the correctness of the Graham Scan algorithm, the proof 
of which is weU known [20]. To show that / 2 (d, t) = s, note that a copy of s is contained on the 
trail t. Our description of p 2 {d, t) states that s is output unless one of the five checks above fails. 
It is trivial to verify that the first three of these checks must be satisfied. The fourth check cannot 
fail, since the polygon described by s is convex (because (d,s) € P). Similarly, if the fifth check 
fails, then the polygon described by s has two local maxima, and this is not possible for a convex 
polygon. 

Part 2: The second condition is: for all d € D aU f 6 T either (/j(d,t) = s and (d,s) 6 P) or 
^a(d, 0 = error. Intuitively, this means that given an input and arbitrary trail, F 2 {d,t) produces a 
solution to the problem or flags an error. Our definition of / 3 (d, t) states that the polygon Q stored 
on the trail is output unless one of the five checks fails. We must therefore demonstrate that if all 
five checks succeed, then Q is the convex hull of the input points d. Let H be the convex hull of 
the points d. The first condition guarantees that every point in d is classified as a hull point or an 


7 



interior point. The second condition guarantees that the triangles used to identify interior points 
are formed from input points, and the third check verifies that the interior points are indeed inside 
their respective triangles. Note that we do not attempt to verify that the triangles on the trail are 
the ones that would be produced by Fi{d). In general, for a given interior point, there may be 
” several triangles of input points in which it is contained. Together, the first three conditions imply 
that all points in H are also in Q, since it is impossible for a hull point to be contained in a triangle, 
g Note that these three checks do not exclude the possibility that interior points are present in Q, 
nor do they guarantee that the ordering of the hull points in Q is correct. The final two checks 
will accomplish this. If the last two checks are satisfied. Lemma 3.5 states that Q is simple, and 
therefore it must be convex by Lemma 3 . 4 . 

^ Thus, Q is a convex polygon whose vertex set is a superset of the vertices of H, i.e., H is 
contained in Q. This implies that no other point from the input set may be a vertex of Q, since any 
^ input point that is not a hull point is interior to H and therefore interior to Q. Finally, it is clear 
that the ordering of the vertices of Q and H must be the same (although there might appear to 
be two possible orderings, clockwise and counterclockwise, a clockwise ordering wUl fail the fourth 
check). Therefore if aU five checks succeed, then the output of FiCcf,*) wUl be the convex hull of d. 

This demonstrates that the algorithms described meet the conditions of Definition 2.2, and are 
therefore a certification traU solution to the convex hull problem. | 

y 

3.2 Other convex hull algorithms 

“ It is possible to use this technique to provide certification traUs for other convex hull algorithms. 

The key is that for each non-hull point p we must find a triangle of input points (not necessarUy hull 
^ points), containing p. For some convex hull algorithms, a contjuning triangle is avaUable directly or 

^ can be easUy computed when it is determined that a particular point is not on the hull. However, 

“ this is not true of aU convex hull algorithms. If, however, we allow extra overhead during the first 

execution we may apply this technique to any planar convex hull algorithm, provided that the 
output is a polygon and not merely an unordered list of huU vertices. 

” Let H = 91,92,93...,?/, be the convex hull of a set of n points. We label the points so that 91 is 

the point with smallest abscissae (and smallest ordinate in case of a tie). Since H is convex, the 
y remaining points occur in sorted angular order around 91 . Now for each non-hull point p, we may 

— determine which triangle Pip,Pi+i it lies in with a binary search. Thus we may determine containing 

triangl^ for the non-hull points in 0 (nlogh) time. Under several distributions the number of hull 
^ points is much smaller than the number of input points [20] so this overhead will often be quite 
small. 

mm 4 Sorting 

g Sorting is one of the most important basic problems in computer science. There is a massive body 

y of literature discussing sorting and a significant fraction of computer time is spent performing sort 

operations. We will see how the certification trail approach may be applied to this problem. Assume 
that a particular sorting algorithm takes as input an array of n elements and outputs an array of 
_ n elements. The algorithm is supposed to place the data into non-decreasing order. 

Note that it may not appear necessary to use a certification trafl for this problem. It might seem 
^ that all that is required is to verify that the output is in non-decreasing order. Unfortunately, this 

^ is not sufficient and we must also verify that the output consists of the same elements as the input. 

A certification trail is required to perform this check efiUciently. 


8 


The information placed on the trail is a permutation relating the input and output arrays. This 
permutation is created by adding an Item Number field to the elements being sorted, such that the 
i-th element is labelled with item number ». After sorting, the permutation is obtained by reading 
the Item Numbers from the elements in their new order. 

The second algorithm reads the permutation from the traU, uses it to rearrange the input elements 
in linear time, and checks that they are now in sorted order. Additionally, it is necessary to check 
that the the information on the certification trail actually is a permutation of n elements, i.e., each 
number from 1 to n occurs exactly once. Should any of these checks fail, the second adgorithm 
outputs “error”, otherwise it outputs the sorted elements. 

Note that the certification trail given for sorting is quite different than that given for the convex 
hull problem. In the latter case, the certification trail was constructed for a particular algorithm, 
and the code executing that algorithm modified to produce the trail. In this case, the sorting 
algorithm is not changed. Instead the data being sorted is modified by a preprocessing step, and the 
necessary information extracted by a postprocessing step. Thus this technique may be implemented 
as a “wrapper” around existing sort routines, no matter which algorithm is implemented. 
Experimental data is presented in Section 6. 

4.1 Proof of correctness 

For concreteness we consider only the sorting of integers, though the proof does not depend on this 
condition. 

Definition 4.1 Let D consist of all finite sequences of integers. Let S consist of all finite non- 
decreasing sequences of integers. Let P : D — ► S be the sorting problem, i.e., (d,s) 6 P iff s is a 
permutation of d (by definition of S, s is a non-decreasing sequence). Note that for every d € D, 
there is a unique s € S such that (d,s) 6 P. Let T consist of finite sequences of integers. For x a 
member of any of the sets D, S, or T, we will also denote the sequence of integers by ij, * 2 * 

Definition 4.2 The function F\ : D — > S x T is defined as foUows. Given an input sequence d 
of N integers, Fi(d) = (s,t) where s is the unique element of S such that, (d,s) € P and t is a 
permutation of 1,2,3,.. .,N s.t., s, = dt, for all i = 1,2, ...N. Note that unless d consists of N distinct 
integers, there will be more than one possible t. The t produced by Fi(d) may be chosen arbitrarily. 
Since for every d 6 D, there exists a unique s € S with (d, s) G P, the function Fi is well defined. 

Definition 4.3 The function F-^ : DxT — ► Su{error} is defined as follows. F 2 (d,t) = dj, ,djj,...,dj^ 
(where d consists of N integers) iff 

i. t contains at least N integers. 

ii. The first N integers of t are a permutation of {1,2, ...N}. 

iii. dt, < d«.^, for i = l,2,...,iV - 1. 

Otherwise, F 2 {d,t) = error. Note that though i may contain more than N integers, F 2 {d,t) 
depends only on the first N. 

The definitions of the functions F\ amd and F 2 correspond to the informal descriptions of the 
sorting algorithms ^ven in the text above. 

Theorem 4.4 Fi and F 2 are a certification trail solution to the sorting problem P. 


9 



Proof: W^e must prove that both conditions of Definition 2.2 are satisfied by these functions. 

Part 1; VVe must prove that for all d e D there exists s € S and t € T such that Fi{d) = {s, t) 
and F 2 {d,t) = s and {d,s) € P. If by definition (d,s) € P. We must show 

that F 2 {d^t) = 5 . t is a permutation of {1,2, ...^iV},so the first two conditions of Definition 4.3 are 
satisfied. Furthermore, by Definition 4,2, dt^ = s, for i = 1, 2, , Since s € S, it is a nondecreasing 

sequence, and thus the third condition of Definition 4.3 is satisfied. Therefore F 2 (d, t) = s. 

Part 2: We must show that for all d e D and all t 6 T either (F2(d,0 = s and (d,s) e P) 

or F 2 {d,i) = error. Pick d € D with length N. Pick t G T. The interesting case is when t is a 
permutation of (1, 2, iV}. If not, then either the first N integers of t are not such a permutation, 
in which case F 2 {d,t) = error. We may ignore the possibility that t consists of such a permutation 
followed by more integers, since Fj depends only on the first N integers of t. 

Examine the sequence d<^ , dt^, , ,d<^. If there is an t such that dt- > d^.^, then the third condition 
of Definition 4.3 is violated so F 2 (d,t) = error. Otherwise F 2 (d,t) = dtj,dtj,...,dt^. Furthermore, 
this is a non-decreasing sequence, so it must be in S. Finally, since this sequence is a permutation 
of d, (d,F 2 (d,t))€ P. 

Therefore, both conditions of Definition 2.2 are satisfied, so F\ and F 2 constitute a certification 
trail solution to sorting. | 

Note that we defined T as the set of all finite sequences of integers. We could have instead defined 
T as the set of permutations of {1,2, ...TV} for all positive N. This would make the function F 2 
“simpler”, in that it doesn’t have to verify that that certification trail consists of a permutation (it 
would, however, have to verify that it consists of a permutation of the correct size). In this case, 
checking that the trail t is indeed a permuation (i.e., actually in its domain) would be left to the 
implementation of the function, 

5 Certification Trails for Shortest Paths 

This classic problem has been examined extensively in the literature. Our approach is applied to 
a variant of the Dijkstra algorithm [11] as explicated in [10]. First we require some preliminary 
definitions. 

Definition 5.1 A graph G = (V, F) consists of a vertex set V and an edge set F. An edge is an 
unordered pair of distinct vertices which we notate with the following style: [v, u;] and we say v is 
adjacent to u?. A path in a graph from to Vk is a sequence of vertices such that 

^ 1 + 1 ] is an edge for i G {1, . . ., A: — 1}. Let u; be a real function defined on F. The length of a 
path from Vi to is the sum of u;([t?,’, v,,|.i]) for each edge [ui,u,+i] in the path. 

^ = (k , F) be a graph and let in be a positive rational valued function defined on F. Given 
a vertex in K, find a set of shortest paths from V\ to each other vertex in V. Note that since in 
is positive on all edges, a shortest path must exist between any two vertices, though it need not be 
unique. 

Before we discuss the algorithm we must describe the properties of the principal data structure 
that are required. Since many different data structures can be used to implement the algorithm, we 
initially describe abstractly the data that can be stored by the data structure and the operations 
that can be used to manipulate this data. The data consists of a set of ordered pairs. The first 
element in these ordered pairs is referred to as the item number and the second element is called 
the item value or just value. Ordered pairs may be added and removed from the set, however, at 
all times the item numbers of distinct ordered pairs must be distinct. It is possible, though, for 


10 



multiple ordered pairs to have the same item value. In this paper the item numbers are integers 
between 1 and n, inclusive. Our default convention is that i is an item number, x is a value and 
is a set of ordered pairs. A total ordering on the pairs of a set can be defined lexicographicaJly 
as follows; (»,x) < (t^x') iff x < x' or (i = x' and i < i^. Our data structure should support a 
subset of the following operations. 

member(i,fi) returns a boolean vzJue of true if h contains an ordered pair with item number i, 
otherwise returns false. 

insert(i,x,fi) adds the ordered pair (»,x) to the set h. 

delete(t, h) deletes the unique ordered pair with item number i from k. 

changekey(t, X, A) is executed only when there is an ordered pair with item number i in h. This 
pair is replaced by (t,x). 

deletemin(A) returns the ordered pair which is smallest according to the total order defined above 
and deletes this pair. If A is the empty set then the token “empty” is returned. 

predecessor(i, A) returns the item number of the ordered pair which immediately precedes the pair 
with item number i in the total order. If there is no predecessor then the token “smaUest” is 
returned. 

A description such as the one above describes an abstmct data type. There may be several 
possible implementations for a particular ADT. In our solution, different ADT implementations 
will be used for the two executions. The first implementation will produce a certification trail 
allowing the second implementation to be simpler and to perform ADT operations more quickly. 

Aside from the implementation of the abstract data type, both of our algorithms are the same. 
Pidgin code for this algorithm appears below. Figure 3 illustrates the execution of the algorithm 
on a sample graph. Table 1 records the data structure operations performed when the algorithm 
is run on the sample graph. The first column gives the operations, with the parameter A omitted 
to reduce clutter. Member operations are also omitted from the table. The second column gives 
contents of A after the execution of each instruction. The third column records the order pair 
deleted by deletemin operations. The fourth column records the information (if any) output to the 
certification trail by this operation. 

This certification trail is created by modifying the insert(i, x, A) and changekey(», x, A) operations 
performed during the first execution. The modified instructions perform the same operations 
described above and in addition output the following information to the certification trail. 

insert(»,i. A) Output the item number of the predecessor of (t,x) (as defined above) to the trail. 
If there is no predecessor, output the token “smallest”. Note that depending on the data 
structure implementation, the predecessor may already be computed during insertion or may 
require a separate call to the predecessor(i. A) operation. 

changekey(i,x. A) Output the predecessor of the ordered pair (t,x) (i.e., pair resulting from the 
change) to the trail. If there is no predecessor, output the token “smallest” to the trail. 

We shall see that this information allows a faster and simpler data structure implementation to be 
used for our second algorithm. 

The algorithm proceeds by maintaining a set 5 of vertices for which shortest path lengths are 
known, and a “frontier” set F of vertices adjacent to members of S along with the best known path 


11 


length from vi . At each step, we find the vertex v in F with smallest known path length and place 
it in 5, F is then updated by examining the neighbors of w. New vertices may be added to F or a 
shorter path (passing through v) may be found to existing vertices in F. 

To efficiently find the vertex to add to 5, the algorithm uses the data structure operations 
described above. As soon as a vertex v is adjacent to some vertex u in S, it is inserted in the set 
F. The value for v is the shortest known path to v, which is the value of u (shortest path to u) 
plus the weight of edge vw. The array element prefer(v) is used to keep track of this “best" edge 
connecting v to S. As the tree grows, information is updated by operations such as insert(i, i, h) 
and changekey(t,z,A). The deletemin(h) operation is used to select the next vertex to add to the 
span of the current tree. Note, the algorithm does not explicitly store paths. Implicitly, however, 
if (w,z) is returned by deletemin, then prefer(») indicates the predecessor of v on the shortest path 
from t»i. 

Algorithm SHORTEST- PATH(G,t)i ,weight) 

Input: Connected graph G = {V,E) where V - {!,. • .,«} with edge weights. 

Output: Lengths of shortest paths from wx to all other vertices. 

1 FOR ALL ti e F, u) := 00 END FOR 

2 wi) := 0 

3 F ;= »i; 

4 WHILE F/0 DO 

5 (o, Ar) := deletemin(F) 

6 FOREACH [u,u;)G FDO 

7 IF u) -b weight((v,«;]) < w) THEN 

8 w) := v) -f weight([v, tn]); prefer(u>) := v 

9 IF member(u), F) THEN changekey(ti;, in), F) 

10 ELSE insert(u7,u;),F) END IF 

11 END IF 

12 END FOR 

13 END WHILE 

14 FOR ALL ueV- {ux}, OUTPUT(u)) END FOR 
END SHORTEST-PATH 

Note that this code may be easily modified to output the shortest paths as well as their lengths. 

First execution: In this execution the SHORTEST-PATH code is used and the abstract data 
type is implemented with a balanced search tree such as an AVL tree (1], a red-black tree [14], or 
a b-tree [5]. In addition, an array indexed from 1 to n is used. Each element of this array contains 
two fields, InSetf a boolean, and Value, storing the same type as the value used in the ordered 
pairs. Initially, InSet is false for all array elements. The balanced search tree stores the ordered 
pairs in k and is based on the total order described earlier. For each item number i, the InSet field 
of the t-th array element is true if amd only if there is a pair with item number i in the set. The 
Value field of the t-th array element stores the value of the pair with item number t, if there is one 
in the set. It is undefined if there is no such padr in the set. This array allows rapid execution of 
operations such as member(t, h) and delete(t, h). 

Second execution: This execution also uses the SHORTEST-PATH code, however, a different 
data structure is used to implement the ADT. We call this data structure an indexed linked list 
and it is depicted in Figure 5. It consists of an array and a doubly linked list. The array is indexed 
from 0 to n and contains pointers to the elements of the linked list. Except for the first element. 


12 




Operation 

Set of Ordered Pairs 

Delete 

Trail 

insert(2,50) 

(2,50) 


smallest 

insert(3,60) 

(2,50),(3,60) 


2 

deletemin 

(3,60) 

(2,50) 


insert(4,130) 

(3,60),(4,130) 


3 

insert(5,62) 

(3,60), (5,62),(4, 130) 


3 

deletemin 

(5,62),(4,130) 

(3,60) 


changekey(4,103) 

(5,62),(4,103) 


5 

deletemin 

(4,130) 

(5,62) 


changekey(4,94) 

(4,94) 


smallest 

insert(6,72) 

(6,72), (4,94) 


smallest 

deletemin 

(4,94) 

(6,72) 


deletemin 


(4,94) 


deletemin 


empty 



Table 1; Example of operations 

and trail. 



each element in the list contains a data field storing an ordered pair. The first element stores a 
special ordered pair (0, '^smallest") which is guaranteed to compare less than any other ordered 
pair. The list is maintained in sorted order based on the total ordering defined above for ordered 
pairs. This list represents the contents of the set h. The i-th element of the array points to the node 
containing the ordered pair with item number i, if such an element is present in h. Otherwise the 
pointer is nil. The 0-th element of the array points to the node containing (0, ^smallest") Initially, 
all pointers are nil except for the 0-th one. Using an ordered list allows us to perform deletemin(/i) 
operations quickly. The array provides rapid random access to the elements. We now describe the 
implementation of the data structure operations. 

insert(i, i,h) Read the next value from the certification trail. This value, caU it j, is the item 
number of the ordered pair that will be the predecessor of (t,z) after it is inserted. To 
insert this element, we follow the j-th array pointer to the list node containing the pair (j, y). 
There is one special case, if “smallest” is read from the trail rather than an item number, 
we follow the 0-th pointer. A new node is allocated and inserted into the list just after the 
node containing {j,y). The data field of this node is set to (*,x). Finally, the »-th pointer is 
set to point to the new node. Figure 5 shows the insertion of (5,62) into the data structure, 
given that the next item on the certification trail is 3. When the insert(i,z,h) operation is 
performed, some checks must be conducted: 

i. The i-th array element must be nil before the operation is performed. 

ii. The value j read from the trail must either be “smallest” or be between 1 and n, i.e., it 
must be a valid item number. 

iii. The j-th array element must not be nil before the operation is performed. 

iv. The sorted order of the pairs stored in the linked list must be maintained. That is, 
if the J-th pointer points to (j, y) and its successor before the insertion (ignoring the 


14 



special case when (j,y) is the last element of the list) is then we must have 

— O'ly) <(».*)< 

If any of these checks fails, then the execution halts and “error" is output. 

— delete(i,ft) If the i-th pointer is nil, halt execution and output “error”. Otherwise follow the i-th 

pointer to find the list node containing This node is removed from the Dst. Note that 

since the list is doubly linked, this is a constant time operation. The i-th pointer is then set 
to nil. The only condition that must be checked is that the t-th pointer is not nil before the 
deletion 

^ changekey(i, x, h) To perform this operation, it suffices to perform delete(i, h) followed by insert(i, i, h). 
The next item for the certification is read when the insert(t, z, h) operation is performed. If 
any of the conditions required by either of these operations fails, then execution halts and 
^ “error” is output. 

deletemin(/i) The 0-th array pointer is traversed to the list head (which contains {Oy** smallest")). 
y The pointer to the next node in the list is followed. If there is no next node then “empty” is 

returned. Otherwise, let (t, x) be the pair stored in that node. We remove the node from the 
list, set the i-th array element to nil, and return (i,*). 

"• member(t,h) The i-th array pointer is examined. “False” is returned if it is nfl, otherwise “true” 
is returned. 


w predecessor(i,h) This operation is not used during the second execution of SHORTEST-PATH, 
but is described for completeness. Follow the i-th pointer to the node containing the pair 
y (*>•*)■ Follow the pointer from that node to the node preceding it on the list (note that this 

^ node will always exist). If this is the special node (0, smallest^, return “smallest”, otherwise 

return the item number of the pair stored in this list. 


^ There are two variations to this scheme that are worth noting. First, we could implement a 
singly linked list rather than a doubly linked list. This eliminates the overhead of maintaining the 
^ extra pointer. Note, however, that operations such as delete(i, h) require access to predecessors in 
^ order to update the list quickly. This can be provided by modifying the operations delete(i,/i), 
changekey(»,z, h), and predecessor ( i, h) so that they output predecessor information to the trail. 
^ The other variation also uses a singly linked list but removes the need for extra certification trail 
S information for delete(t, /») and changekey(»,z,h) operations. It uses the technique of marking a 
list node for deletion rather than removing them from the list node immediately (the appropriate 
pointer in the array is stiU set to nil immediately). When performing other operations, we check 
~ for and remove any marked nodes immediately following nodes visited. The total running time is 
still linear, though insert operations are no longer constant time operations. 

H Time complexity: In the first execution each data structure operation can be performed in 
S O(log(n)) time where |V| = n. There are at most 0(m) such operations and 0{m) additional time 
overhead where |£| = m. Thus, the first execution can be performed in 0(m log(n)) In addition, 
it provides us with a relatively simple and illustrative example of the use of a certification trail. 

.. In the second execution each data structure operation can be performed in 0(1). There are still 
at most 0(m) such operations and 0(m) additional time overhead. Hence, the second execution 
can be performed in 0(m) time, i.e., linear time. 

„ Section 6 contains results of timing experiments with this technique. 


15 



16 




















5.1 Proof of correctness 

VV'e wish to prove that the two algorithms given above constitute a certification trail solution to the 
SHORTEST-PATH problem, i.e., that the functions Fi{d) and / 3 (d, f) defined by these algorithms 
satisfy Definition 2.2. First, we consider the problem of evaluating a sequence of the above data 
structure operations. 

Definition 5.2 Let D be the set of finite sequences of the data structure operations defined above. 
Let S be the set of finite sequences of answers to data structure operations. Let P be the relation 
(<f, a) where d € D'and a € S, and a is the sequence of answers resulting from executing the 
operations d starting with the empty set. 

Note that we are examining all finite sequences of data structure operations, not just “legal” 
ones. That is, may attempt to perform an insertion with an item number already in use, attempt 
to perform deletion on an item number not being used, etc. We assume that if one of these “illegal” 
operations is attempted, the operation will output “error” and terminate processing. Thus, we can 
define the answer sequences for these “illegal” sequences. 

Definition 5.3 Let F\{d) be defined by the result of executing the operations on any of the stan- 
dard data structures described above, with the insert(i, x, h) and changekey(t, z, h) operations mod- 
ified to output trail information. Let F 2 {d,t) be defined by the result of executing the operations 
using the indexed linked list implementation described above. 

Theorem 5.4 Fi{d) and Fj(d,t) meet the conditions of Definition 2.2 (that is, Fi(d) and F 2 (d,t) 
constitute a certification trail solution for P ). 

Proof: We must prove that both conditions of Definition 2.2 are satisfied by these functions. 

Part 1: The first condition we must verify is that for all d € D there exists s 6 S and there 
exists t € T such that E,(d) = (s,() and f’j(d,t) = s and (d,s) € P. Let (s,t) = Fi{d). The 
modifications of the data structure operations that produce trail output do not affect how the data 
structure is maintained. Proofs of correctness for the standard data structures are well known, so 
we may assume (d, s) 6 P. We must demonstrate that F 2 {d,t) = s. 

This may be proven by showing that after each operation that modifies the set h, the elements 
stored in the indexed Unked list (our implementation) correspond to the elements in the set h (the 
abstract definition). We must also demonstrate that if this relationship is maintained, then correct 
output is generated by operations that generate output. 

To demonstrate this, we show that each operation maintains the following invariants. 

i. If the pair (t, z) is in h U (0, '^smallest''), then the i-th pointer in the array of pointers points 
to the list node containing (i,x). 

ii. If, for some t, there is no pair in h with item number i then the i-th pointer is nil. 

iii. The list nodes are in ascending order. 

iv. Every list node is pointed to by some pointer in the array. (Together with the first condition, 
this implies that it is pointed to by exactly one pointer from the array). 

The first two conditions assert that the indexed linked list and the set h contain the same 
elements (ignoring the special list head element in the linked list). The last two invariants allow us 
to demonstrate that the linked list operations function correctly. 


17 


Clearly each of these conditions is true before the first operation is performed (the set of pairs 
is empty, all pointers except the 0-th are nil, and (0, ** smallest*') is the only list node). 

Assume that the above conditions are satisfied after the first k operations, and that the output 
generated by any of the first k operations is correct. We claim that the invariants will will remain 
satisfied after the {k+ l)-st operation, and that if the l)-st operation generates output, it will be 
correct. Let s{k + 1) denote the output produced by the (Jb-l-l)-st operation (where Fi{d) = (s,t)). 

Consider each possible operation. For brevity, we omit details for “illegal” operations, i.e., those 
that violate the precondition of the operation. Similarly, we omit details of the special case of 
“smallest” being read from the trail. 

insert(i, X, h) The trail t contains the item number j of the predecessor of (*, x). Call the predecessor 
(j«y)- By assumption, the i-th pointer is nil before the insert. If not, this operation outputs 
“error” and execution halts. Since the indexed linked list correctly represents h at this point, 
this agrees with the result returned by Fi(d), i.e., s{k -h 1) = “error". After the insertion is 
performed, the t-th pointer is set to the new node containing (t, x), so the first condition is 
satisfied. No other nodes are added to the list, so the second condition will rem<un true. The 
third condition is satisfied since (j,y) is now the immediate predecessor of (», x). Since no 
other pointer in the array has been changed, the fourth condition is still true. 

delete(«, A) This operation sets the t-th pointer to nil, and removes the node containing (t, x) 
from the list. This satisfies the second invariant. Deleting a node cannot violate the third 
invariant. Since no other nodes are removed and no other pointers are changed, the first and 
fourth invariants remain satisfied. 

deletemin(A) By assumption, the nodes are currently in ascending order. Thus, the minimum 
element in h must correspond to the node following the special list head node, call the pair it 
contains (»,x). This pair is the correct output for this operation. As with delete, the above 
four conditions remain true after this node is removed and the i-th pointer set to nil. 

changekey(t,x,A) We have implemented changekey(i, x, A) as an insertion followed by a deletion. 
Since both of those preserve the invariants, changekey(i, x, h) must do so as well. 

member(t,A) By assumption, the indexed linked list correctly represents h before this operation, 
so the output of this operation will be correct. Since this operation does not change the set 
or the indexed Unked list, the invariants remain satisfied. 

predecessor (i, A) By assumption, the indexed link list correctly represents A, and furthermore it is 
currently in sorted order. Thus, the list element preceding the node containing (i,i) is the 
predecessor. Since this operation changes neither h nor the indexed linked list, the invariants 
remain satisfied. 

This demonstrates that the first condition of Definition 2.2 is satisfied. 

Part 2; The second condition is for all d € O and for all t € T either (Fj(d,t) = s and 
(d, s) 6 P) or F’ 2 (d,t) = error. Intuitively, this states that if F 2 (d,t) is passed an arbitrary trail, it 
either outputs a correct answer, or it outputs “error”. We prove an even stronger condition. Let 
tcorrect the trail returned by Fi(d), i.e., F\{d) = (a, tcorrect)- Then either tcorrect Is a prefix of t, 
or Fj(d, f) = error. 

If tcorrect •* a prefix of t, then we are done. The algorithm describing Fj(d, t) does not examine 
any part of the trail after tcorrcct, so F 2 (d,t) = s. 


18 



If tcorrect is not a prefix of t, let p be the position at which they first differ. Let 0 be the number 
of the operation that uses the trail data at p. Then operation 0 is either an insert(»,i,/i) or 
changekey(i, z, fi) operation. If it is an insert operation, then contains the item number of 

the predecessor of (i, i). Since t contains a different value, call it j, at this location, the insert(i, i, h) 
operation will fail one of it’s three checks. Either j will not be valid item number, or the j-th 
pointer will be nil, or the pair {j,y) will not be the predecessor of (t, z). The argument for the 
changekey(t,z,/») operation is essentially the same. 

Thus, the second condition is satisfied. 

Therefore, Fi{d) and Fj(d, t) are a certification trail solution to P, the problem of evaluating 
data structure operations. | 

Definition 5.5 Let D be the set of finite graphs G = {V, E) with edge weights consisting of positive 
integers. Assume the indices are numbered 1 through n. Let S be the set of finite ordered tuples 
of positive integers. Let P be the relation that associates each graph with the tuple consisting of 
the minimum path lengths to each vertex. Let SPi{d) be the function defined by the SHORTEST- 
PATH algorithm with the data structure defined for the first execution. Let 5Pj(d, t) be the function 
defined by the SHORTEST-PATH algorithm using the indexed linked list implementation. 

Corollary 5.0 SP\{d) and SPj(d, t) constitute a certification trail solution for P. 

Proof; If SPi(d) = (s,t), then the correctness of Dijkstra’s algorithm implies that (d,s) € 
P. The algorithms that compute SPi{d) and SP 2 {d,t) are the same except for data structure 
implementation. Theorem 5.4 implies that if these algorithms generate the same data structure 
operations, then the same sequence of answers will be generated. Thus, to demonstrate that 
5Pj(d, t) = 3, it must be shown that the same sequence of data structure operations is generated 
by both algorithms. Examination of SHORTEST-PATH indicates that the k-th data structure 
operation to be performed is dependent only on the input and the result of previous data structure 
operations. For example, at line 9, either an insert(i,z,/») or a changekey(i, i, fi) is performed, 
depending on the result of a member(t, /i) operation. The input graph d is identical for both 
algorithms, thus the first data structure operation performed must be the same. Assume that the 
first k operations performed by both algorithms are identical. Then, by Theorem 5.4, the answers 
to those operation will be the same. Since the {k -j- l)-st operation depends only on the input and 
the results of the previous k operations, it must also be the same for both algorithms. Therefore 
the same sequence of data operations is performed in both algorithms, so 5Pj(d, t) = s. 

The proof that the second condition holds is the same as for Theorem 5.4. Either the input trail 
t contains the “correct” trail as a prefix, or one of the data structure operations wiU fail, resulting 
in an “error” output. | 

One point has been glossed over in the above proof. In the SHORTEIST-PATH algorithm results 
of deletemin(/i) are not output nor aje they stored in the certification trail. It might be possible for 
incorrect answers to be returned by deletemin(/i) operations while still producing correct shortest 
paths and lengths. The second execution of the SHORTEST-PATH algorithm wiD not detect this 
since the correct output is produced. By proving that the answers to deletemin(fi) operations are 
the same, we have proven more than strictly required. 

6 Experimental Data on Certification Trails 

We have performed extensive timing experiments on several basic and weQ-known problems, includ- 
ing the ones described in this paper. Algorithms for solving these problems were implemented, both 


\ 


with and without the use of certification trails. Timing data was coUected on both the certification 
trail solutions and the basic solutions. The following tables summarize these results. 


Size 

Basic Algorithm 

First Execution 
(.A.lso Generates Trail) 

Second Execution 
(Uses Trail) 

Speedup 

Percent 

Savings 

5000 

0.61 

0.62 

0.07 

8.73 

43.62 

10000 

1.33 

1.34 

0.14 

9.56 

44.54 

25000 

3.68 

3.68 

0.36 

10.22 

45.12 

50000 

7.68 

7.74 

0.71 

10.75 

44.94 

100000 

16.23 

16.30 

1.43 

11.35 

45.39 

200000 

33.93 

34.37 

2.84 

11.94 

45.16 


Table 2: Convex Hull 


Size 

Basic Algorithm 

First Execution 
(Also Generates Trail) 

Second Elxecution 
(Uses TVail) 

Speedup 

Percent 

Savings 

10000 

0.28 


0.04 


39.29 

50000 

1.80 

1.90 

0.19 

9.47 

41.94 

100000 

3.96 

4.08 

0.41 

9.66 

KEm 

500000 

23.95 

24.69 

2.14 

11.19 

keeei 

1000000 

50.23 

51.57 

4.38 

11.47 

44.31 


Table 3: Sort 


Size 

Basic Algorithm 

First Execution 
(Also Generates Trail) 

Second Execution 
(Uses IVail) 

Speedup 

Percent 

Savings 

100,1000 

HIIIIQQ21l|[Hlli 

0.05 

0.02 

2.00 


250,2500 


0.16 

0.06 



500,5000 

0.31 

0.33 

0.11 

BEEEHI 


1000,10000 


0.76 

0.23 

3.04 

29.29 

2000,20000 

1.58 

1.67 

0.45 

3.51 


2500,25000 

2.06 

2.15 

0.55 

3.75 

K2EUI 


Table 4: Shortest Path 


The timing information was gathered on Sun SPARCstation ELC with 16MB of RAM. The 
system was run as a standalone machine in single user mode during timing experiments. 

Much of the data presented in the timing table is essentially self-explanatory relative to the 
certification trail technique and algorithms considered. However, a brief discussion of the table 
entries is appropriate. 

The column labeUed Basic Algorithm contains timing data which gives the execution time of the 
algorithm in producing the output without the generation of the certification trail. All timing data 
is listed in seconds. 


20 




















































































The First Execution column gives the execution time of the algorithm in producing the output 
with the additional overhead of generating the certification trail. 

The Second Execution column gives the execution time of the algorithm in producing the output 
while using the certification trail. 

The Speedup column is the ratio of the run times of the Basic Algorithm and the Secondary 
Execution. One reason this figure is important is that it is possible for the two algorithms to run in 
different environments (different hardware, programming language, etc). A high speedup indicates 
that less powerful hardware or a higher level language (with associated overhead) may be sufficient 
for the second execution. 

The Percent Sovings column records the percentage of the execution time savings which is gained 
by using the certification traiil method as compared to 2-version programming approach. The time 
required for a 2-version programming approach was estimated by doubling the time reported in the 
Basic algorithm. This assumes that both versions take approximately the same amount of time to 
execute. 

In addition to the tables, the timing information for the convex hull algorithm is plotted in 
Figure 5. Plots for the other two examples are similar. 

Examination of the data collected for the convex hull algorithm indicates that: 

• The overhead in generating the certification trail is very small, less than 2% of the running 
time of the basic (no certification trail) algorithm. 

♦ The second execution is very fast, achieving an order of magmtude speedup for larger input 
sizes. This suggests that a single “second algorithm’’ process could easily handle the output 
generated by several “first algorithm’’ processes running in parallel. Alternately, the high 
speedup would allow the second execution to be run on lower performance (and hence less 
expensive) hajdward. Finally, the large speedup and reduced code complexity may make it 
possible to take advantage of a formally verifiable language (which may require significant 
overhead) in implementing the second algorithm. 

The data for sorting indicates that the certification trail also requires very low overhead and 
results in a large speedup. For the shortest path problem the overhead is still very low, and the 
speedup, while not as dramatic as for the first two problems, is still quite respectable. 

7 Comparison With Other Techniques 

The certification trail approach shares similarities with other valuable fault tolerance and fault 
detection techniques that have been previously proposed and examined, but in each case there are 
significant and fundamental distinctions. These distinctions are primarily related to the generation 
and character of the certification trail and the manner in which the secondary algorithm uses the 
certification trail. 

First consider the important and useful technique called N-version programming [9, 3]. When 
using this technique N different implementations of an algorithm are independently executed with 
subsequent comparison of the resulting N outputs. There is no relationship among the executions of 
the different versions of the algorithms other than that they all use the same input; each algorithm 
is executed independently without any information about the execution of the other algorithms. In 
marked contrast, the certification trail approach allows the primary algorithm to generate a trail 
of information which can be read by the secondary algorithm. The advantages of utilizing this 
additional information are shown in the body of this paper. In effect, N-version programming can 
be thought of relative to the certification trail approach as the employment of a null trail. 


21 



22 



Another valuable technique, known as the recovery block approach [2, 18, 21], was proposed by 
Randell. It uses acceptance tests and alternative procedures to produce what is to be regarded as 
a correct output from a program. When using recovery blocks, a program is viewed as a being 
structured into blocks of operations, which after execution yield outputs which can be tested in 
some informal sense for correctness. The rigor, completeness, and nature of the acceptance test 
is left to the program designer, and many of the acceptance tests that have been proposed for 
use tend to be somewhat straightforward [2]. When using certification trails it is clearly possible 
to combine the second execution and the comparison test to yield a program which certifies the 
correctness of the output of the first execution. Unlike an acceptance test this certifier must satisfy 
strict formal properties of correctness. Also note that the certification trail technique emphasizes 
the capability of generating additional data to ease the certifying process and does not rely solely 
on data which would normally be computed. It should be possible to fruitfully combine the ideas 
of recovery blocks and certification trails. 

Algorithm- based fault tolerance [15, 17, 19] uses error detecting and correcting codes for perform- 
ing reliable computations with specific algorithms. This technique encodes data at a high level and 
algorithms are specifically designed or modified to operate on encoded data and produce encoded 
output data. Algorithm- based fault tolerance is distinguished from other fault tolerance techniques 
by three characteristics: the encoding of the data used by the algorithm; the modification of the 
algorithm to operate on the encoded data; and the distribution of the computation steps in the 
algorithm among computational units. The error detection capabilities of the algorithm-based fault 
tolerance approach are directly related to that of the error correction encoding utilized. The cer- 
tification trail approach does not require that the data to be executed be modified nor that the 
fundamental operations of the algorithm be changed to account for these modifications. Instead, 
only a trail indicative of aspects of the algorithm’s operations must be generated by the algorithm. 
As seen in Section 6, the production of this trail does not add significant overhead. Moreover, any 
combination of computational errors can be handled. 

Recently, Blum and Kannan [6] have defined what they call a program checker. This interesting 
work has been followed by a burst of activity in this general area [12, 7, 25, 8, 4]. Each of these 
papers, however, describes work which differs significantly from the work we present. A program 
checker is an algorithm which checks the output of another algorithm for correctness. An early 
example of a program checker is the algorithm developed by Tarjan [23] which takes as input a 
graph and a supposed minimum spanning tree and indicates whether or not the tree actually is a 
minimum spanning tree. 

The Blum- Kannan program checking method differs from the certification trail method in two 
important ways. First, the checker is designed to work for a problem and not a specific algorithm. 
That is, the checker design is based on the input/output specification of a problem and no assump- 
tions are made about the method being used to solve the problem. Because of this the algorithm 
which is being checked is treated as a black box. It can not be altered nor can its internal status 
be examined and exploited. In the certification trail approach the algorithm being checked is not 
treated as a black box. Instead, the algorithm can be modified to generate additional information 
(i.e., the certification trail) which is considered to be useful in the checking/ verification process. By 
exploiting this capability it is sometimes possible to design certification trail solutions which allow 
faster checking than Blum-Kannan program checkers. Of course, these faster solutions are more 
specialized than the Blum-Kannan checkers which are guaranteed to work for any algorithm which 
solves the original problem. We believe that the added speed often outweighs the disadvantage of 
specialization. 

The second important difference concerns the number of times that the program which is being 
checked is executed. In the Blum-Kannan approach the program may be invoked a polynomial 


23 



number of times. In the certification trail approach the program is run only once. Thus, the overall 
time complexity of the checking process can be significantly larger for Blum-Kannan checkers. 

A third less important difference stems from the fact that Blum-Kannan checkers are defined 
in a more general probabilistic context. Certification trails are currently defined only for deter- 
ministic programs and checkers. However, it is clearly possible to define them in the more general 
probabilistic context. 

Other work has been done to extend the ideas of Blum-Kannan to give methods which allow 
the conversion of some programs into new programs which are self-testing and self-correcting [12, 
7]. However, these methods are also based on treating programs as black boxes and thus have 
limitations similar to Blum-Kannan program checkers. A recent paper by Blum et al. [8] concerns 
checking the correctness of memories and data structures. The results described in that paper 
differ from our work using abstract data types in one central way. The checkers that they design 
are tightly constrained in memory usage. Typically, they use only 0(log(n)) storage to check data 
structures of size 0(n). Our results do not place space constraints on the algorithm used to certify 
the data structure. Without a space constraint we are able to certify abstract data types such as 
priority queues which are more complex than the data structures that they check, i.e., stacks and 
queues. Also, we are able to achieve a speed up in the checking process and they are not. 

Babai, Fortnow, Levin and Szegedy [4] present methods which appear to allow remarkably fast 
checking, i.e., in polylogarithmic time. Their approach has some similarities to the methods we 
propose. Both methods modify original algorithms to yield new algorithms which output additional 
information. We refer to this additional information as a certification traU and they refer to this 
information as a untness. In our case we are interested in modified algorithms which have the same 
asymptotic time complexity as the original algorithm. Indeed, the modified algorithm should be 
slowed down by at most a factor of two. In [4] the modified algorithm is slowed down by more than 
any fixed multiplicative factor. Specifically, if the original algorithm has a time complexity of 0{T) 
then the modified algorithm has a time complexity of Note that in practice the e cannot 

be too small because its inverse appears in the exponent of the checker time complexity. Another 
difference between our methods is the fact that their method requires that the input and output 
be encoded using an error-correcting code. The encoding process takes time for strings 

of length N. However, many of the checkers we have developed take only linear time so the cost 
of simply preparing to use their method appears to be too great in some cases. It is also necessary 
to decode the output after the check. Lastly, we note that Fortnow has stated that their result is 
currently not practical [24]. 

8 Generalization and Future Research Areas 

The experimentaJ timing data on certification trails indicates that this technique is of great practical 
value as well as of theoretical interest. Furthermore, the techniques illustrated are applicable to a 
wide range of problems, especially the certification of Abstract Data Types described in the shortest 
path example. There are many areas of interest for future exploration, a few of which are described 
below. 


8.1 Certified Data Structure Libraries 

It is apparent that the certification trail technique described for the SHORTEiST-PATH program 
may be used for a variety of problems. Since the certification trail is produced and used by abstract 
data type operations, the technique may be used with any algorithm that can be implemented in 
terms of those abstract data types. Creating a library of such “certified data types" enables 


24 


programmers to create fault tolerant programs without having to be familiar with the certification 
trail technique. Object oriented programming appears to be well suited to this task. 

A possible objection to this is that it provides fault detection only for the data structure imple- 
mentation, since the surrounding code is simply reused. Furthermore, the data structure imple- 
mentation is likely to come from library code, and hence be highly reUable. In answer to this note 
that: 

• In many algorithms, the code using the data structure is much simpler than the code imple- 
menting the data structure. 

• Although the example above illustrated reuse of using the data structures, it is certainly 
possible for this code to be developed separately for the first and second execution programs. 

• Errors are often found even in code that has been in use for a long period of time. The added 
confidence of using this technique may be desirable even for library code. 

• Even if the library code is highly reliable, the certification trail can be helpful in detecting 
errors caused by hardware problems. 

• Library code may have to be tuned or even rewritten to meet for a particular application or 
environment, partially negating the claim of using well-tested code. 

Even if fault detection is not an issue, the certification trail technique is useful during program 
testing and debugging. Input may be automatically generated and processed. If the output of the 
first and second executions differ or an error is otherwise flagged, the input set is flagged. This 
reduces the need to otherwise compute output for selected input and enables both more and larger 
sets of input to be processed. 2-version programming may be used during debugging in a similar 
manner, however certification trails have the advantage of reduced overhead, aUowing more test 
cases to be run, a reduction in the hardware required for testing, or both. 

8.2 Almost-concurrent execution of the certification trail 

In the above discussion and examples, the certification tradl programs have been executed serially, 
i.e., we do not run the second execution until after first execution completed. ActuaUy, except for 
sorting, the two executions in the examples above can be run almost-concurrently. The “second” 
execution simply reads the information from the certification trail as it becomes available. The two 
programs will finish nearly simultameously, the difference being in the time after the last element 
is read from or written to the certification trail. 

8.3 Continuing after an error 

A possible extension to the use of certification trails is to attempt to continue the second execution 
after an error is detected. Consider the shortest path example using abstract data types. In 
that example, the second execution used am indexed linked list that performed each operation in 
constant time by using the certification trail from the first execution. Suppose that am error had 
been detected during the second execution. Rather than simply aborting, it may be possible to 
continue execution. This could be done by 

• Reorganizing the existing set into some other data structure (such an AVL tree, red-black 
tree, etc.) that adlows efficient operation without a certification trail. 


25 


111! ill I s:: 


• Continuing to use the indexed linked list and ignoring the rest of the certification trail. Note 
that this would result in some operations requiring more time. 

• Continuing to use the indexed linked list and attempting to use the certification trail for future 
operations. This may be possible if the error that occurred has sufficiently “local” effect. For 
example, if part of a tree structure is corrupted during the first execution, it is still possible 
that operations involving other parts of the tree will be performed correctly. 

On a related topic, research has been done on “self-correcting” data structures in which enough 
redundancy is built into a data structure so that it may be reconstructed if part of it is corrupted. 
Using certification trails with such structures could provide an efficient detector for corruption of 
the data structure. 


References 

[1] Aderson-Vel’skii, G. M., and Landis, E. M., “An algorithm for the organization of informa- 
tion”, Soviet Math. Doki, pp. 1259-1262, 3, 1962. 

[2] Anderson, T., and Lee, P., Fault tolerance: principles and practices, Prentice- Hall, Englewood 
cuffs, NJ, 1981. 

[3] Avizienis, A., “The N-version approach to fault tolerant software,” IEEE Trans, on Software 
Engineering, vo\. 11, pp. 1491-1501, Dec., 1985. 

[4] Babai, L., Fortnow, L., Levin, L., and Szegedy, M., “Checking computations in polylogarithmic 
time, ” Proceedings of the 23rd ACM Symposium on Theory of Computing, pp. 21-31, 1991. 

[5] Bayer, R., and McCreight, E., “Organization of large ordered indexes”, Acta Inform., pp 
173-189, 1, 1972. 

[6] Blum, M., and Kannan, S., “Designing programs that check their work”. Proceedings of the 
1989 ACM Symposium on Theory of Computing, pp. 86-97, ACM Press, 1989. 

[7] Blum, M., Luby, M., and Rubinfeld, R., “Self-testing/correcting with appUcations to numerical 
problems,” Proceedings of the 22nd ACM Symposium on Theory of Computing, pp. 73-83, 1990. 

[8] Blum, M., Evans, W., Gemmell P., Kannan, S., and Naor, M., “Checking the correctness of 
memories,” Proceedings of the 32nd IEEE Symposium on Foundations of Computer Science 
pp. 90-99, 1991 

[9] Chen, L., and Avizienis A., “N-version programming: a fault tolerant approach to reUabiUty of 
software operation,” Digest of the 1978 Fault Tolerant Computing Symposium, pp. 3-9, IEEE 
Computer Society Press, 1978. 

[10] Cormen, T. H., and Leiserson, C. E., and Rivest, R. L., Introduction to Algorithms McGraw- 
HiU, New York, NY, 1990. 

(llj Dijkstra, E. W., “A note on two problems in connexion with graphs,” Numer. Math. 1, pp. 
269-271, Sept., 1959. 

[12] GemmeU, R., Lipton, R., Rubinfeld, R., Sudan, M., and Wigderson, A., “Self- 
testing/correcting for polynomials and for approximate functions,” Proceedings of the 23rd 
ACM Symposium on Theory of Computing, pp. 32-42, 1991. 

y 


26 


[13] Graham, R. L., “An efficient algorithm for determining the convex huU of a planar set”, 
Information Processing Letters, pp. 132-133, 1, 1972. 

[14] Guibas, L. J., and Sedgewick, R., “A dichromatic framework for balanced trees”, Proceedings 
of the Nineteenth Annual Symposium on Foundations of Computing, pp. 8-21, IEEE Computer 
Society Press, 1978. 

[15] Huang, K.-H., and Abraham, J., “Algorithm-based fault tolerance for matrix operations,” 
IEEE Trans, on Computers, pp. 518-529, vol. C-33, June, 1984. 

[16] Johnson, B., Design and analysis of fault tolerant digital systems Addison-Wesley, Reading 
MA, 1989. 

[17] Jou, J.-Y. and Abraham, J. “Fault tolerant FFT networks,” Dig. of the 1985 Fault Tolerant 
Computing Symposium, pp. 338-343, IEEE Computer Society Press, June, 1985. 

[18] Lee, Y.H. and Shin, K.G., “Design and evaluation of a' fault-tolerant multiprocessor using 
hardware recovery blocks,” IEEE Trans. Comput., vol. C-33, pp. 113-124, Feb. 1984. 

[19] Nair, V., and Abraham, J., “General linear codes for fault-tolerant matrix operations on 
processor arrays,” Dig. of the 1988 Fault Tolerant Computing Symposium^ pp. 180-185, June, 
1988. 

[20] Preparata F. P., and Shamos M. I., Computational geometry: an introduction. Springer- Verlag 
New York, NY, 1985. 

[21] Randell, B., “System structure for software fault tolerance,” IEEE Trans, on Software Engi- 
neering, vol. 1, pp. 220-232, June, 1975. 

[22] Siewiorek, D., and Swarz, R., The theory and practice of reliable design, Digital Press, Bedford 
MA, 1982. 

[23] Tarjan, R. E., “Applications of path compression on balanced trees”, J. ACM, pp 690-715 
Oct., 1979. 

[24] Paul Wallich, “Crunching Epsilon,” Scientific American, pp. 22-24, Jan., 1993 

[25] Andrew Chi-Chih Yao, “Coherent Functions and Program Checkers,” Proc. 22 ACM Symp. of 
Theory of Computing, pp. 84-94. 



Finally we discuss the work our group has performed on the 
design and implementation of fault injection testbeds for experi- 
mental analysis of the certification trail technique This work em- 
ploys two distinct methodologies: software fault injection (mod- 
ification of instruction, data, and stack segments of programs on 
a Sun Sparcstation ELC and on an IBM 386 PC) and hardware 
fault injection (control, address, and data lines of an Motorola 
MC68000-based target system pulsed at logical zero/one values). 

Our results indicate the viability of the certification trail tech- 
nique. We also believe the tools we have developed provide a 
solid base for additional exploration. 

Keywords: Software fault tolerance, certification trails, error 
monitoring, design diversity, data structures. 

1 Introduction 

Certification trails are a recently introduced and promising approach to 
fault-detection and fault-tolerance [1, 3]. In this paper, we report on a com- 
prehensive attempt to assess experimentally the performance and overall 
value of the method. We have implemented several fundamental algorithms 
together with versions of the algorithms which generate and utilize certifica- 
tion trails. Specifically, algorithms for the following problems are analyzed: 
huffman tree, shortest path, minimum spanning tree, sorting, and convex 
hull. Our results reveal many cases in which an approach using certification 
trails allows for significantly faster overall program execution time than a 
basic time redundancy approach. 

We also examine algorithms for the answer-validation problem for ab- 
stract data types. This kind of problem was originally proposed in [3] and 
provides a basis for applying the certification-trail method to wide classes of 
algorithms. For this paper we implemented and analyzed answer- valid at ion 
solutions for two abstract data types. The first solution is for a simplified 
priority queue which allows insert, min and deletemin operations, and the 
second solution is for a priority queue which allows insert, min, delete and 
deletemin operations. In both Ccises, the algorithm which performs answer- 
validation is substantial faster than the original algorithm for computing the 
answers. 

This paper next presents a simple probabilistic model and analysis which 
enables comparison between the certification-trail method and the time- 


2 

HOT FILMED 


redundancy approach. The analysis shows that when the certification-trail 
method has a smaller execution time than the time-redundancy approach 
it yields strictly superior performance. This means the method has both 
a a smaller probability of error and a smaller probability of undetected 
error. Surprisingly, the analysis also reveals the intriguing result that the 
certification-trail method often can display superior performance even when 
the method has the same execution time or a longer execution time than the 
time-redundancy approach. This superior behavior stems from the typical 
assymetry of the execution times of the first and second executions in the 
certification-trail method. 

The paper next discusses the work our group has performed on the design 
and implementation of fault injection testbeds. This work employs two 
distinct methodologies: software fault injection and hardware fault injection. 
The software fault injection tool is similar to an interactive debugger but 
more accurately can be considered an interactive bugger. It allows programs 
to be halted and faults to be injected by direct modification of the stack, 
data and instruction segments of a program. Output can then be captured 
and characterized. 

The hardware fault injector is based on injecting faults into an operating 
microprocessor. The injection is performed by explicitly setting one or more 
pins of the microprocessor to logical zero and/or logical one values. The 
timing and duration of the pin setting is under control of a supervisory 
processor. The testbed also includes a multi-processor system. This system 
consists of three processors which are connected to one another pairwise by 
shared banks of dual ported memory. We plan to use this system to conduct 
evaluation of systems which utilize concurrent execution of algorithms using 
the certification-trail method. 

2 Introduction to Certification Trails 

To explain the essence of the certification-trail technique for software fault 
tolerance, we will first discuss a simpler fault-tolerant software method. In 
this method the specification of a problem is given and an algorithm to solve 
it is constructed. This algorithm is executed on an input and the output is 
stored. Next, the same algorithm is executed again on the same input and 
the output is compared to the earlier output. If the outputs differ then an 
error is indicated, otherwise the output is accepted as correct. This software 
fault tolerance method requires additional time, so-called time redundancy 


3 


[32, 52]; however, it requires no additional software. It is particularly valu- 
able for detecting errors caused by transient fault phenomena. If such faults 
cause an error during only one of the executions then either the error will be 
detected or the output will be correct. The second possibility, of undetected 
faults, occurs when the output of the execution is unaffected by the faults, 

A variation of the above method uses two separate algorithms, one for 
each execution, which have been written independently based on the problem 
specification. This technique, called N-version programming [16, 12] (in 
this case N=2), allows for the detection of errors caused by some faults 
in the software in addition to those cause by transient hardware faults and 
utilizes both time and software redundancy. Errors caused by software faults 
are detected whenever the independently written programs do not generate 
coincident errors. 

The certification-trail technique is designed to obtain similar types of 
error-detection capabilities but expend fewer resources. The central idea, 
as illustrated in Figure 1, is to modify the first algorithm so that it leaves 
behind a trail of data which w’^e call a certification trail. This data is chosen 
so that it can allow the the second algorithm to execute more quickly and/or 
have a simpler structure than the first algorithm. As above, the outputs of 
the two executions are compared and are considered correct only if they 
agree. Note, however, we must be careful in defining this method or else 
its error detection capabibty might be reduced by the introduction of data 
dependency between the two algorithm executions. For example, suppose 
the first algorithm execution contains an error which causes an incorrect 
output and an incorrect trail of data to be generated. Further suppose 
that no error occurs during the execution of the second algorithm. It stiU 
appears possible that the execution of the second algorithm might use the 
incorrect trail to generate an incorrect output which matches the incorrect 
output given by the execution of the first algorithm. Intuitively, the second 
execution would be “fooled” by the data left behind by the first execution. 
The definitions we give below exclude this possibility. They demand that 
the second execution either generate a correct answer or signal that an error 
has been detected in the data trail, 

3 Formal Definition of a Certification Trail 

In this section we will give a formal definition of a certification trail and 
discuss some aspects of its realizations and uses. 


4 


N94- 36064 


Certification of Computational Results 

Gregory F. Sullivan* 

Dwight S. Wilson’ 

Gerald M. Masson’ 

Dept, of Computer Science, Johns Hopkins Univ., Baltimore, MD 21218 


^. 3-7 


Abstract 

We describe a conceptually novel and powerful technique to achieve fault detection 
and fault tolerance in hardware and softwsire systems. When used for software fault 
detection, this new technique uses time and software redundancy and can be outlined as 
follows. In the initial phase, a program is ran to solve a problem and store the result. 
In addition, this program leaves behind a trail of data which we call a certification trail. 
In the second phase, another program is ran which solves the ori^nal problem again. 
This program, however, has access to the certification trail left by the first program. 
Because of the availability of the certification traO, the second phase can be performed 
by a less complex program and can execute more quickly. In the final phase, the two 
results are compared and if they agree the results are accepted as correct; otherwise an 
error is indicated. An essential aspect of this approach is that the second program must 
always generate either an error indication or a correct output even when the certification 
trail it receives from the first program is incorrect. We formalize the certification trail 
approach to fault tolerance and iUustrate realizations of it by considering algorithms 
for the following problems: convex hull, sorting, and shortest path. We discuss cases in 
which the second phase can be run concurrently with the first amd act as a monitor. We 
compare the certification trail approach to other approaches to fault tolerance. 


Keywords: Software fault tolerance, error monitoring, design diversity, data structures. 


1 Introduction 

In this paper we describe a novel and powerful technique for achieving fault tolerance in systems. 
Although applicable to both hardware and software implementation, we restrict our discussion 
of this technique to implementation in software. To explain our technique, we wUl first discuss 
a simpler method. In this method the specification of a problem is given and an algorithm to 
solve it is constructed. This algorithm is executed on a particular input and the output is stored. 
Next, the same algorithm is executed again on the same input and the output is compared to the 
earlier output. If the outputs differ then an error is indicated, otherwise the output is accepted 
as correct. This method requires additional time, so called time redundancy [16, 22]; however, it 
requires no additional software. It is particularly valuable for detecting errors caused by transient 
fault phenomena. If such faults cause an error during only one of the executions then either the 
error will be detected or the output will be correct. 

A variation of the above method uses two separate algorithms, one for each execution, which have 
been written independently based on the problem specification. This technique, called N- version 
programming [9, 3] (in this case N=2), allows for the detection of errors caused by some faults in 

'Reseuch pwtiaDy supported by NSF Grants CCR-8910569 and CCR-8908092. 

^Research partiaDy supported by NSF Grant CDA-9015667. 

^Research partiaQy supported by NASA Grant NSG 1442. 


1 




Figure 1: Certification trail method. 


the software in addition to those caused by tr 2 msient hau'dware faults and utilizes both time and 
software redundancy. Errors caused by software faults are detected whenever the independently 
written programs do not generate coincident errors. 

A significant drawback to the above approaches is the overhead required. Either extra time 
is required to run the algorithms serially on a single processor or extra hardware is required to 
run them in parallel. The technique we will describe is designed to achieve similar types of error 
detection capabilities while reducing the required resource overhead. The central idea, as illustrated 
in Figure 1, is to modify the first algorithm so that it leaves behind a trail of data which we call a 
certification trail. This data is chosen to allow the second algorithm to execute more quickly and/or 
have a simpler structure than the first algorithm. As above, the outputs of the two executions are 
compared and are considered correct only if they agree. Note, however, that we must be careful in 
defining this method or else its error detection capability might be reduced by the introduction of 
data dependency between the two algorithm executions. For example, suppose the first algorithm 
execution contains an error which causes an incorrect output and an incorrect trail of data to be 
generated. Further suppose that no error occurs during the execution of the second algorithm. It 
appears possible that the execution of the second algorithm might use the incorrect trail to generate 
an incorrect output which matches the incorrect output produced by the first algorithm. Intuitively, 
we can regard the two executions as “adversaries.” The second execution must guard against an 
incorrect certification trail “fooling” it into producing an incorrect output. The definitions we give 
below exclude this possibility. They demand that the second execution either generates a correct 
answer or signals the fact that an error has been detected in the certification trail. 

2 Formal Definition of a Certification TVail 

In this section we will give a formal definition of a certification trail and discuss some aspects of 
its reaJizatioDs and uses. 

Definition 2.1 A problem P is formalized as a relation, i.e., a set of ordered pairs. Let D be the 
domain (that is, the set of inputs) of the relation P and let S be the range (that is, the set of 
solutions) for the problem. We say an algorithm A solves a problem P iff for all d 6 D when d is 
input to A then an s € S is output such that (d, s) 6 P. 


2 





Definition 2«2 Let P : D S be a problem. A solution to this problem using a certification 
trail consists of two functions F\ and F 2 with the following domains and ranges Fi : D — ► S x T 
and F 2 : D X T S U {error}. T is the set of certification trails. The functions must satisfy the 
following two properties: 

(1) for all d 6 D there exists s € S and there exists t 6 T such that 

F\{d) = {s^t) and F 2 (d, t) = s and (d,s) e P 

(2) for d € D and for all t € T 

either (Fa(d, t) = s and (d^s) 6 P) or ^^(d^t) = error. 


We also require that F\ and F% be implemented so that they map elements not in their respective 
domains to the error symbol. The definitions above assure that the error detection capability of 
the certification trail approach is comparable to that obtained with the simple time redundancy 
approach discussed earlier. (That is, if transient hardware faults occur during only one of the 
executions then either am error will be detected or the output will be correct.) It should be further 
noted, however, that the examples to be considered will indicate that this approach can also save 
overall execution time. 

The certification trail approach also allows for the detection of faults in software. As in 2- 
version programming, separate teams can write the algorithms for the first and second executions. 
Note that the specification now must include precise information describing the generation and 
use of the certification trail. Because of the additional data available to the second execution, 
the specifications of the two phases can be very different; similairly, the two algorithms used to 
implement the phases can be very different. (This will be illustrated in the convex hull example to 
be considered later.) Alternatively, the two algorithms can be very similar, differing only in data 
structure manipulations. (This will be illustrated in the shortest path example to be considered 
later.) When significantly different algorithms are used, the probability that both algorithms will 
contsun or be affected by faults which generate matching errors should be reduced. When very 
similar algorithms are used it is sometimes possible to save programming effort by sharing program 
code. For example, the code implementing any data structures needed by the program might be 
different, while the code that uses the data structure operations would be the same. This approach 
is vrell suited for the creation of libraries of fault- tolerant data structures. While this reduces the 
ability to detect errors in the software it does not change the ability to detect transient hardware 
errors as discussed earlier. Furthermore, in situations like the above example, it is possible (perhaps 
even probable) that the majority of software errors will be in the data structure implementation. 
Thus the ability to detect software errors may not be reduced as much as first imagined. 

Throughout this section we have assumed that our method is implemented with software, how- 
ever, it is clearly possible to implement the method with assistance from dedicated hardware. It 
is also possible to generalize the basic idea we have suggested. We discuss some of these gener- 
alizations in a later section. Finally, we note that a wide variety of approaches to software fault 
tolerance have been proposed and we contrast our method to the most closely related ideas in a 
later section. 

In the following two sections we illustrate the application of certification trails to three well- 
known and significant problems in computer science; the convex hull problem, sorting, and the 
shortest path problem. It should be stressed that the certification trail is not limited to these 
problems. Rather, these algorithms have been selected for illustrative purposes. 


3 



3 Certification Trails for Convex Hulls 

The convex huU problem is a fundamental one in computational geometry. Our certification trail 
solution is based on a solution due to Graham [13] caUed Graham’s Scan. For basic definitions in 
computational geometry see the text of Preparata and Shamos [20]. This text also illustrates some 
statistical applications of convex hull computations. For simplicity in the following discussion we 
will assume the points are in so called general position, i.e., no three points are co-linear. It is not 
difficult to remove this restriction. 

Definition 3.1 The convex hull of a set of N points, 5, in the Euclidean plane is defined as the 
smallest convex polygon enclosing all the points. This polygon is unique and its vertices are a 
subset of the points in 5. It is specified by a counterclockwise sequence of its vertices. 

The algorithm pven below constructs the convex hull incrementally in a counterclockwise fash- 
ion. Sometimes it is necessary for the algorithm to “backup” the construction by throwing some 
vertices out and then continuing. The first step of the algorithm selects the point with minimum 
x-coordinate (using minimum y-coordinate to break ties), and calls it pi. For each other point q 
in S we compute the slope of the fine piq. Sort the points of S (except for pi) by this slope (since 
the points are in general position, the slopes are distinct). Number these vertices P2,P3> — 

It is not hard to show that after these three steps the points when taken in order, ■ ■ >Pni 

form a simple polygon; although this polygon might not be convex. It is possible to think of the 
algorithm as removing points from this simple polygon until it becomes convex. This code below 
performs this by “walking” through the vertices in order. The main FOR loop iteration adds points 
to the polygon under construction. After a point is added, the inner WHILE loop checks the angle 
formed by the addition of this point. (Note: We measure angles as follows: Given the three points 
we measure the angle from qm-iqm to q^pt in the clockwise direction.) If the angle 
is not acute (i.e., it makes the the polygon non-convex), then the angle vertex (i.e., the preceding 
point on the polygon) is removed. Note that this will change the preceding angle, which may 
now be obtuse and should be eliminated. The WHILE loop terminates when an acute angle is 
encountered. Figure 2 illustrates the construction of a convex hull using this algorithm, from the 
hull. 

When the main FOR loop is complete the convex hull has been constructed. 

Algorithm CONVEXHULL(5) 

Input: Set of points, 5, in 

Output: Counterclockwise sequence of points in R^ which define convex hull of S 

1 Let Pi be the point with the smallest x coordinate (and smallest y to break ties) 

2 For each point p (except pi ) calculate the slope of the line through pi and p 

3 Sort the points (except pi ) from the smallest slope to the largest. 

Call them pa, . . .,Pn 

4 qi := Pi; ft := Pa; ft := Ps; m = 3 

5 FOR fc = 4 to n DO 

6 WHILE the angle formed by g„_i , , Pt is > 180 degrees DO 

7 m := m — 1 

8 END WHILE 

9 m := m -I- 1 

10 qm := Pfc 

11 END FOR 

12 FOR i = 1 to m DO, OUTPUT(ft) END FOR 


4 





Figure 2: Convex 


hull example. 


END CONVEXHULL 

First execution: To generate a certification trail for this algorithm, we rely on the property 

that for e^h pomt eliminated by the WHILE loop in the code above, we can produce a triangle of 
points in S containing the eliminated point. 

ThTOrem 3.2 Let p, a, b, and c, be points in the plane such that no three are co-linear, p has the 
smallest x-cwrdinate of the four points (and the smaller y-coordinate if another other point has the 
same x-coordxnate) slope{pa) < slope(^) < slope{pc). If the angle abc is obtuse (measured in the 
clockunse direction), then b is inside the triangle pac. 

- ordering of the slopes, b is inside the triangular wedge determined by the rays 

and pc. Note that the Une segments po and pc are in the half plain x > p„ and in at least one 
the inequity is strict, since no three points are co-linear. This impUes that the angle ape fin 
he clockwise Erection) must be greater than 180 degrees. Since the angle abc is also obtuse, both 
p and b must be on the same side of line ac. Therefore, b is inside the triangle pac. | 


Corollary 3.3 During execution of CONVEXHULL, if, after adding p*, the angle formed by 
. . 9m-i,^m,Pt w obtuse (measured in the clockwise direction), then q„, is contained in the triangle 
rf Pl»9m-l,Pfc. ^ 

Proof: slope{piq^_^) < slope{p^) < slope(p{pi'). | 


5 


In the first execution the code CONVEXHULL is used. The certification trail is generated by 
adding an output statement within the WHILE loop. Specifically, if an angle greater than 180 
degrees is found in the WHILE loop test then the 4-tuple consisting of is output to 

the certification traul. The table below shows the 4-tuples of points that would be output by the 
algorithm when run on the example in Figure 2 . The points in the table are given the same names 
as in Figure 2 . The final convex hull points are also output to the certification trail. 

Finally, the trail output does not consist of the actual points in B?. Instead, it consists of indices 
to th e original input data. This means if the o rigi nad data consists of si, sj, . . . , Sn then rather than 
OUtpuCThe element in corresponding to s,- the number i is output. If point coordinates were 
output instead of these indices, the second execution would have to verify that the points on the 
trail are members of 5 . 


Point not on convex hull Three surrounding points 


P3 

P5 

Pt 


P4»PliP2 

P8.Pl,P4 

P8.Pl,P6 


Second execution: Let the certification trail consist ofaset of 4-tuples, (zi,ax, 6 i,ci),(x 2 , 03 , 63 , 
..., (Xr,Or,br,Cr) followed by the supposed convex hull, 9 i, 92 f-> 9 m- The code for CONVEX- 
HULL is not used in this execution. Indeed, the algorithm performed is dramatically different than 
CONVEXHULL. 

It consists of five checks on the trail data. 

i. Th at there is a one to one correspondence between the input points and the points in 
{xi,...,Zr}u{9x,...,g,„}. 

ii. That for t € {1,. . .,r}, o,, 6 ,, and c, are among the input points. 

iii. For i € { 1 , . . ., r} that z, lies within the triangle defined by a,- , 6 ,-, and c,. 

iv. That for each triple of counterclockwise consecutive points on the supposed convex hull the 
angle formed by the points is acute. 

V. That there is a unique point among the points on the supposed convex hull which is a locally 
maximal point. We say a point q on the hull is a local maximum point if its predecessor in the 
counterclockwise ordering has a strictly smaller y coordinate and its successor in the ordering 
has a smaller or equal y coordinate. 

If any of these checks fail then execution halts and “error” is output. As mentioned above, the 
trail data actually consists of indices into the input data. This does not unduly complicate the 
checks above; in fact it makes it easier to verify the first and second conditions. 

Time complexity: In the first execution the sorting of the input points takes O(nlog(n)) time 
where n is the number of input points. One can show that this cost dominates and the overall 
complexity is 0 (nlog(n)). 

It is possible to implement the second execution so that all five checks are done in 0(n) time. 
Because indices into the input data are used, the first condition can be checked by verifying that 
each index is used exactly once, and that all indices are between 1 and N. The second condition 
may checked simply by verifying that each index is between 1 and N. Checking that a point lies 


within a triangle is a geometric calculation that can be done in constant time. Checking that the 
angle formed by three points is acute requires only constant time. The third and fourth checks can 
be done in 0(n) because the certification trail contains indices into the input data as described 
above. The uniqueness of the **local maximum’’ requires only a constant tune calculation at each 
point, so it may checked in linear time. 

Experimental timing data for this method may be found in Section 6. 


3.1 Proof of correctness 


We wish to prove that the algorithms above constitute a certification trail solution for the convex 
hull problem. Although the definition is phrased in terms of functions, not algorithms, we can 
simply define the functions Fi{d) and F 2 {d,t) on particular arguments as the values computed by 
the associated algorithms. 

Using our formal definition of certification trails, let D be the set of all finite planar point sets 
T. Let S be the set of convex polygons, with vertices in counterclockwise order (the restriction to 
counterclockwise ordering makes the convex hull unique). Then the problem we are considering is 
HULL : D — S where HULL^T) is the polygon in S that forms the convex hull of T. 

The description of the algorithms above defines functions Fi and F 2 . We must show that both 
conditions of Definition 2.2 hold. The following two lemmas, which we state without proof, are 
required. 


Lemma 3.4 Let P be a polygon on n points pi,P2> • ■ MPn- P is a convex polygon iffPis simple 
yOnd each angle PiPjpk w less than or equal to 180 degrees, where i is in 1,2, ...n, j = (t + 1) mod n, 
and fc = (j + 2) mod n. 


Lemma 3.5 IfPisa non-simple polygon, then either P has more than one local maxima, or the 
interior angle at some vertex is greater than 180 degrees. 


Theorem 3.6 Fi{d) and F 2 {d,t), as defined above, constitute a certification trail solution for the 
problem HULL, 

Proof: We must prove that both conditions of Definition 2.2 are satisfied by these functions. 

Part 1: Recall that the first condition is: for all d € D there exists s € S and t € T such 

that Fi(d) = (s,0 and F 2 {d,t) = s and (d,s) € P. Intuitively, this means that if both executions 
perform correctly, then they will both output the convex hull of the input, which is unique. Note 
that generation of the certification trail does not affect the output of the Graham Scan algorithm. 
Thus the condition on F\{d) is satisfied by the correctness of the Graham Scan algorithm, the proof 
of which is well known [20]. To show that F 2 {d,t) = s, note that a copy of s is contained on the 
trail t. Our description of F 2 {dyt) states that s is output unless one of the five checks above fails. 
It is trivial to verify that the first three of these checks must be satisfied. The fourth check cannot 
fail, since the polygon described by s is convex (because (d,s) G P)« Similarly, if the fifth check 
fails, then the polygon described by s has two local maxima, and this is not possible for a convex 
polygon. 

Part 2: The second condition is: for alld€Dallt€T either {F 2 {d,t) = s and (d,s) € P) or 
Faidyt) — error. Intuitively, this means that given an input and arbitrary trail, / 2 (d, t) produces a 
solution to the problem or flags an error. Our definition of /li(d, 0 states that the polygon Q stored 
on the trail is output unless one of the five checks fails. We must therefore demonstrate that if all 
five checks succeed, then Q is the convex hull of the input points d. Let H be the convex hull of 
the points d. The first condition guarantees that every point in d is classified as a hull point or an 


7 


^ «v„a. triangle „fi„p„.';™tt cl?“':d*'’T:' ^ 

VNotP ,ha. th JtfrchXdo „« ax “f '^"‘ ™ > ‘"4 ' 

»or do thay eoarlte .1.^. .1- a , ‘I-' P<»sibdiV that interior points are present in O 

TjrreTorrtnVtr^: ly EtS “ - 'rsT' tS:'<5":‘.‘:;ri 

that the ordering of ,he vertices of <J !!ld7 mn^ the”lTf;;.t”'h'”.i?' ''“‘T’ “ “ 

^Jetre-=r^^^^ 


.1.2 


Other convex hull algorithms 


Vhe h^s“h\rfo“r“ “dtw '^r ' 

oints), containing p. For some convex hull alporitl> *“gle of input points (not necessarily hull 

"Lw be easUy computed when it i« ,1 t ^ ^gonthms, a containing triangle is avaUable directly or 

-this is no, “ “■>• “■ Ho-ver 

_tec„.ion we ntay apl 'to Jr''''’ *' «« 

-L.tpn.is_a poiygL r„„fn.::tr„:o^d“^:dt.“ofi^^^^^^^^ ™ 

~ the convex hull of a set of n Doints w<> Uk i ei. 

i" case' of ^^cTrLl ’ij' 

foT'l'ht iomh,!^ pSs ^ bin^y search. Thus"^ may JeferSe fouT^dng 

is tench steaher 


4" Sorting 

'oll^e"rl.t?l°In?sLg“‘ sif '/'t'r •“ T""'" »««/ 

.operations. We wili see how the certihcatTonTafl arp'J^h r^™W »p'w te M°' 'tf°™s°* 
>Lt a particular sorting algorithm takes as innnt an m / i PP““ *bis problem. Assume 

’treTatl^Lrf 

E :. an .!::: r4Ci'’rerin:7tL:r.n\r • >• 

s*ffot sufficient and we must also verify that th ^ t** T ot<*er. Unfortunately, this 

V^rtihcatio. traii is r^nL^To^lter'chT^^^^^^^^ '"P"‘- 


8 


CWQJNAi PAGE fS 
OF POOR QUALITY 



The information placed on the trail is a permutation relating the input and output arrays. This 
permutation is created by adding an Item Number field to the elements being sorted, such that the 
»-th element is labelled with item number i. After sorting, the permutation is obtained by reading 
the Item Numbers from the elements in their new order. 

The second algorithm reads the permutation from the trail, uses it to rearrange the input elements 
in linear time, and checks that they are now in sorted order. Additionally, it is necessary to check 
that the the information on the certification trail actually is a permutation of n elements, i.e., each 
number from 1 to n occurs exactly once. Should any of these checks fail, the second algorithm 
outputs “error”, otherwise it outputs the sorted elements. 

Note that the certification trail given for sorting is quite different than that given for the convex 
hull problem. In the latter case, the certification trail was constructed for a particular algorithm, 
and the code executing that algorithm modified to produce the trail. In this case, the sorting 
algorithm is not changed. Instead the data being sorted is modified by a preprocessing step, and the 
necessary inforjnation extracted by a postprocessing step. Thus this technique may be implemented 
as a wrapper around existing sort routines, no matter which algorithm is implemented. 
Experimental data is presented in Section 6. 

4.1 Proof of correctness 

For concreteness we consider only the sorting of integers, though the proof does not depend on this 
condition. 

Definition 4.1 Let D consist of all finite sequences of integers. Let S consist of all finite non- 
decreasing sequences of integers. Let P : D ^ S be the sorting problem, i.e., (d,s) € P iff s is a 
permutation of d (by definition of S, s is a non- decreasing sequence). Note that for every d 6 D, 
there is a unique s 6 S such that (d,s) e P. Let T consist of finite sequences of integers. For x a 
member of any of the sets D, S, or T, we will also denote the sequence of integers by *i, i 2 » •••, ijv. 

Definition 4.2 The function Fi : D - S x T is defined as foUows. Given an input sequence d 
of N inteprs, F,(d) = where s is the unique element of S such that, (d,s) € P and t is a 
permutation of 1,2, 3,. ..,N s.t., s, = dt, for all i = 1,2, ...N . Note that unless d consists of N distinct 
mtegere, there will be more than one possible /. The t produced by Fi{d) may be chosen arbitrarUy. 
bince for every d € D, there exists a unique s € S with (d,s) 6 P, the function Fi is weU defined. 

Definition 4.3 The function Fj ; DxT Su{error} is defined as follows. A(d, t) = d# d, dt 
(where d consists of N integers) iff " ’ ^ 

i. t contains at least N integers, 

ii. The first N integers of t are a permutation of {1,2,...N}. 

iii. dt, < dt,^^ for t = 1,2,..., jV - 1. 

Otherwise, F 2 {d,t) = error. Note that though t may contain more than N integers, Foid t) 
depends only on the first N , ' ’ 

The definitions of the functions F\ and and Fj correspond to the informal descriptions of the 
sorting algorithms given in the text above. 

Theorem 4.4 Fi and Fj are a certification trail solution to the sorting problem P. 


9 


Proof: We must prove that both conditions of Definition 2.2 are satisfied by these functions. 

Part 1; We must prove that for all <f € D there exists 3 € S and t ^ T such that Fi{d) = ( 3 , t) 
and F 2 {d,t) = 3 and (d, 3 ) € P. If F\{d) = (3,t), then by definition (d, 3 ) € P. We must show 
that F 2 {d,t) = 3 , t is a permutation of {1,2, iV}, so the first two conditions of Definition 4.3 are 
satisfied. Furthermore, by Definition 4.2, dj, = 3 , for t = 1,2, ...N, Since 3 € S, it is a nondecreasing 
sequence, and thus the third condition of Definition 4,3 is satisfied. Therefore F 2 {d^t) = 3 . 

Part 2: We must show that for all d € D and all ^ 6 T either (/2(^i0 = ^ ^ P) 

or /^(d, i) = error. Pick d G D with length N, Pick ^ 6 T. The interesting case is when t is a 
permutation of {1,2, N}. If not, then either the first N integers of t are not such a permutation, 
in which case F 2 (d, t) = error. We may ignore the possibility that t consists of such a permutation 
followed by more integers, since F 2 depends only on the first N integers of t. 

Examine the sequence dtj > d<j, , , d<^. If there is an i such that d<, > d<-^j then the third condition 
of Definition 4.3 is violated so ^^(d, t) = error. Otherwise f 2 (d,t) = dt^,dij,...ydt^. Furthermore, 
this is a non- decreasing sequence, so it must be in S. Finally, since this sequence is a permutation 
ofd, (d,F 2 (d,t))e P. 

Therefore, both conditions of Definition 2.2 aj*e satisfied, so F\ and F 2 constitute a certification 
trail solution to sorting. | 

Note that we defined T as the set of all finite sequences of integers. We could have instead defined 
T as the set of permutations of {1,2,,.../V} for all positive N. This would make the function F 2 
“simpler”, in that it doesnH have to verify that that certification trail consists of a permutation (it 
would, however, have to verify that it consists of a permutation of the correct size). In this case, 
checking that the trail t is indeed a permuation (i.e., actually in its domain) would be left to the 
implementation of the function. 

5 Certification Ttails for Shortest Paths 

This classic problem has been examined extensively in the literature. Our approach is applied to 
a variant of the Dijkstra algorithm [11] as explicated in [10]. First we require some preliminary 
definitions. 

Definition 5.1 A graph G = consists of a vertex set V and an edge set E. An edge is an 

unordered pair of distinct vertices which we notate with the following style: [u, u;] and we say v is 
adjacent to u;. A path in a graph from Vi to ujt is a sequence of vertices such that 

^i+i] is an edge for i € {1, . . — 1}. Let u; be a real function defined on E. The length of a 

path from Vi to Vk is the sum of u?([v», t?^+i]) for each edge [v,, in the path. 

Let G = (V', E) be a graph and let tu be a positive rational valued function defined on £. Given 
a vertex v\ in V, find a set of shortest paths from Ui to each other vertex in V. Note that since xv 
is positive on all edges, a shortest path must exist between any two vertices, though it need not be 
unique. 

Before we discuss the algorithm we must describe the properties of the principal data structure 
that are required. Since many difierent data structures can be used to implement the algorithm, we 
initially describe abstractly the data that can be stored by the data structure and the operations 
that can be used to manipulate this data. The data consists of a set of ordered pairs. The first 
element in these ordered pairs is referred to as the item number and the second element is called 
the item value or just value. Ordered pairs may be added and removed from the set, however, at 
all times the item numbers of distinct ordered pairs must be distinct. It is possible, though, for 


10 


■ betw«i! “™ ™!"'- P*l«' ''■' >•'“ 

A k a « t f” ^***^ default convention is that t is an item number, z is a value and 

as follow, ‘'I °'‘!r'^ total ordering on the pairs of a set can be defined lexicographically 

subset of ihl folwii opelitionl'' ‘ ^ ^ 

_ >nember(i, *) returns a boolean value of true if b contains an ordered pair with item number i 
Otherwise returns false. ’ 

_ *“sert(i,z, A) adds the ordered pair (»,i) to the set h. 

delete(j, A) deletes the unique ordered pair with item number » from h. 

® cbangekey(i,z, A) is executed only when there is an ordered pair with item number i in h This 
pair IS replaced by (t, z). 

■ **^***l™i“i^? returns the ordered pair which is smaUest according to the total order defined above 

and deletes this pair. If fi is the empty set then the token “empty” is returned. 

I h) returns the item number of the ordered pair which immediately precedes the pair 

return^™ “““ ‘ ^ predecessor then the token “smaUest” is 

“ powibirim*’?”" describes an abstract data type. There may be several 

w^bi u3 ^ ^ ADT implementations 

aUowine ^he first implementation will produce a certification trail 

“ S fronbT to perform ADT operations more quickly. 

= on a samnl ^ appears below. Figure 3 illustrates the execution of the algorithm 

” is run on data structure operations performed when the algorithm 

= to reduce clutteI”*^M column gives the operations, with the parameter h omitted 

S contentrof A lft ‘^We. The second column gives 

a certiUcatiou t'JZ” Z opc^L”' Of *«y) ««tput *o .be 

“ pe™mZ‘dtrint’°.r.i modifylug the mscn{i, a, A) and changekeyfi, a, b)operationu 

- ZcribZahZ f i The modified inctrucion, perform .he same opera.ions 

M and in addition output the following information to the certification trail. 

insert(*, i,A) Output the item number of the predecessor of (i,z) (as defined above) to the traU 

1 strurtnl'" the token “smaUest”. Note that depending on the data 

rdv. ^ 'mplementation, the predecessor may already be computed during insertion or may 
require a separate caU to the predecessor(i, h) operation. ^ 

I chaa goky (f a ^ , 1 , prsdecsssor of .he ordered pair (f,a) (i.e., pmr resoJtmg from .he 

ange) to the trail. If there is no predecessor, output the token “smallest” to the trail. 

I «“oZ »‘«ion .0 be 

The algorithm proewds by maintaining a set 5 of vertices for wUch shortest path lengths are 
^ known, and a frontier set F of vertices adjacent to members of S along with the best kno^ path 


11 



length from Vi. At each step, we find the vertex u in F with smallest known path length and place 
it in 5, F is then updated by examining the neighbors of w. New vertices may be added to F or a 
shorter path (passing through v) may be found to existing vertices in F. 

To efficiently find the vertex to add to S , the algorithm uses the data structure operations 
described above. As soon as a vertex v is adjacent to some vertex u in 5, it is inserted in the set 
F. The value for v is the shortest known path to v, which is the value of u (shortest path to u) 
plus the weight of edge vw. The array element prefer(t;) is used to keep track of this “best” edge 
connecting v to 5. As the tree grows, information is updated by operations such as insert(i, z,h) 
and changekey(i,i,h). The deletemin(h) operation is used to select the next vertex to add to the 
span of the current tree. Note, the algorithm does not explicitly store paths. Implicitly, however, 
if (w, x) is returned by deletemin, then prefer(t?) indicates the predecessor of v on the shortest path 
from Vi. 

Algorithm SH0RTEST-PATH((7,vi ,weight) 

Input: Connected graph G = (V, F) where V = (1, . . ., n} with edge weights. 

Output: Lengths of shortest paths from to all other vertices. 

1 FOR ALL u 6 K, u) := 00 END FOR 

2 vi ) := 0 

3 F 

4 WHILE F^0 DO 

5 (», A:) := deletemin(F) 

6 FOR EACH Km) G F DO 

7 IF t?) + weight([v, m]) < w) THEN 

8 m) := ») -f weight([u, m]); prefer(m) := » 

9 IF member(m, F) THEN changekey(m, m), F) 

10 ELSE insert(m, m), F) END IF 

11 END IF 

12 END FOR 

13 END WHILE 

14 FOR ALL ueV - {ui}, OUTPUT(u)) END FOR 
END SHORTEST-PATH 

Note that this code may be easily modified to output the shortest paths as well as their lengths. 

First execution: In this execution the SHORTEST-PATH code is used and the abstract data 
type is implemented with a balanced search tree such as an AVL tree (l), a red-black tree [14], or 
a b-tree [5]. In addition, an array indexed from 1 to n is used. Each element of this array contains 
two fields, InSet, a boolean, and Value, storing the same type as the value used in the ordered 
pairs. Initially, InSei is false for all array elements. The balanced search tree stores the ordered 
pairs in h and is based on the total order described eajlier. For each item number t, the InSet field 
of the i-th array element is true if and only if there is a pair with item number i in the set. The 
Value field of the t-th array element stores the value of the pair with item number i, if there is one 
in the set. It is undefined if there is no such pair in the set. This array allows rapid execution of 
operations such as member(t, h) and delete(t, h). 

Second execution: This execution also uses the SHORTEST-PATH code, however, a different 
data^ structure is used to implement the ADT. We call this data structure an indexed linked list 
and it is depicted in Figure 5. It consists of an array and a doubly linked list. The array is indexed 
from 0 to n and contmns pointers to the elements of the linked list. Except for the first element. 


12 



Operation 

Set of Ordered Pairs 

Delete 

Trail 

insert(2,50) 

(2,50) 


smallest 

insert(3,60) 

(2,50), (3, 60) 


2 

deletemin 

(3,60) 

(2,50) 


insert(4,130) 

(3,60),(4,130) 

3 

insert(5,62) 

(3,60),(5,62),(4,130) 


3 

deletemin 

(5,62),(4,130) 

(3,60) 


changekey(4,103) 

(5, 62), (4,103) 

5 

deletemin 

(4,130) 

(5,62) 


changekey(4,94) 

(4,94) 

smallest 

insert(6,72) 

(6,72), (4,94) 


smaUest 

deletemin 

deletemin 

deletemin 

(4,94) 

(6,72) 

(4,94) 

empty 


Table 1 : Example of operations and trail. 

each element in the Ust contains a data field storing an ordered pair. The first element stores a 
special ordered pair (0, “smo/Zest") which is guaranteed to compare less than any other ordered 
pair. The list is maintained in sorted order based on the total ordering defined above for ordered 
pairs. This Ust represents the contents of the set h. The i-th element of the array points to the node 
containing the ordered pair with item number t, if such an element is present in h. Otherwise the 
pointer is nil. The 0-th element of the array points to the node containing (0, '‘smallest*') Initially 
aU pointers are nil except for the 0-th one. Using an ordered Ust allows us to perform deletemin(/i) 
operations quickly. The array provides rapid random access to the elements. We now describe the 
implementation of the data structure operations. 

insert(i,i,h) Read the next value from the certification trail. This value, caU it j, is the item 
number of the ordered pair that will be the predecessor of (i,x) after it is inserted. To 
insert this element, we Mow the j-th array pointer to the Ust node containing the pair (j, y). 
Th«e is one special case, if “smaUest” is read from the trail rather than an item number, 
we foUow the 0-th pointer. A new node is aUocated and inserted into the Ust just after the 
node containing (j,y). The data field of this node is set to (*,*). FinaUy, the »-th pointer is 
set to point to the new node. Figure 5 shows the insertion of (5,62) into the data structure 
given that the next item on the certification traU is 3. When the insert(i,i,h) operation is’ 
performed, some checks must be conducted; 

i. The j-th array element must be nil before the operation is performed. 

u. The value j read from the trail must either be “smallest” or be between 1 and n, i.e. it 
must be a valid item number. * * 

Ui. The j-th array element must not be nil before the operation is performed. 

iv. The sorted order of the pairs stored in the Unked list must be maintained. That is, 
if the j-th pointer points to (j, y) and its successor before the insertion (ignoring the’ 


14 



special case when (j, y) is the last element of the list) is O', y'), then we must have 

0»y) < (*,x) < O'ly')- 

If any of these checks fails, then the execution halts and “error” is output. 

delete(i,h) If the t-th pointer is nil, halt execution and output “error”. Otherwise follow the j-th 
pointer to find the list node containing (i,i). This node is removed from the list. Note that 
since the list is doubly linked, this is a constant time operation. The i-th pointer is then set 
to nil. The only condition that must be checked is that the i-th pointer is not nil before the 
deletion 

changekey(i, x, h) To perform this operation, it suffices to perform delete(i, h) followed by insert(t, i, h). 
The next item for the certification is read when the insert(i, *, ft) operation is performed. If 
any of the conditions required by either of these operations fails, then execution halts and 
“error” is output. 

deletemin(^) The 0-th array pointer is traversed to the list head (which contains (0, '‘smallest’')). 
The pointer to the next node in the list is foUowed. If there is no next node then “empty” is 
returned. Otherwise, let (»,x) be the pair stored in that node. We remove the node from the 
list, set the i-th array element to nil, and return (i, *). 

member(i, /i) The i-th array pointer is examined. “False” is returned if it is nil, otherwise “true” 
is returned. 

predecessor(i, h) This operation is not used during the second execution of SHORTEST-PATH, 
but is described for completeness. Follow the i-th pointer to the node containing the pair 
(i,x). FoUow the pointer from that node to the node preceding it on the list (note that this 
node will always exist). If this is the special node (0, '‘smallest"), return “smallest”, otherwise 
return the item number of the pair stored in this list. 

There are two variations to this scheme that are worth noting. First, we could implement a 
singly linked list rather than a doubly linked list. This eliminates the overhead of maintaining the 
extra pointer. Note, however, that operations such as delete(i, h) require access to predecessors in 
order to update the list quickly. This can be provided by modifying the operations delete(j, h), 
changekey(i, I, h), and predecessor(i, h) so that they output predecessor information to the trad. 

The other variation also uses a singly linked list but removes the need for extra certification trail 
information for delete(i,/i) and changekey(», x,h) operations. It uses the technique of marking a 
list node for deletion rather than removing them from the list node immediately (the appropriate 
pointer in the array is still set to nil immediately). When performing other operations, we check 
for and remove any marked nodes immediately foUowing nodes visited. The total running time is 
still linear, though insert operations are no longer constant time operations. 

Time complexity: In the first execution each data structure operation can be performed in 
O(log(n)) time where jV'l = n. There are at most 0(m) such operations and 0{m) additional time 
overhead where \E\ = m. Thus, the first execution can be performed in 0(mlog(n)) In addition, 
it provides us with a relatively simple and illustrative example of the use of a certification trail. 

In the second execution each data structure operation can be performed in 0(1). There are still 
at most 0(m) such operations and 0(m) additional time overhead. Hence, the second execution 
can be performed in 0(m) time, i.e., linear time. 

Section 6 contains results of timing experiments with this technique. 


15 






16 



















5.1 Proof of correctness 

SHORTEST PA^ constitute a certification trail solution to the 

satbf/DefiniU^^^ 2 F T’ ' ‘^ese algorithms 

structure operations.' ’ evaluating a sequence of the above data 

' Let^s"brthe*set ^ fi^itP^ sequences of the data structure operations defined above. 

. {dJ Xre rf € D and ITT" 1 ^ *>« ^^e relation 

nn^r^t- j ^ ^ £ S, and s is the sequence of answers resulting from executine the 

operations d starting with the empty set. * executing the 

oni!° examining all finite sequences of data structure operations, not just “legal” 

to HA 7*^ to perform an Insertion with an item number already in use attempt 

that if one^of these “iUegaJ” 

Ik"* '* the operation will output “error” and terminate processing. Thus w7can 

define the answer sequences for these “illegal” sequences. >ng- xnus, we can 

JiSf r"t of executing the operations on any of the stan- 

ard data structures described above, with the insert(i, x, h) and changekeyfi x h) operations mod 

g he indexed bnked list implementation described above. 

*" P«rt l-^ThT? Definilion 2.2 iire satisfied by these fanctions. 

exists^ € Tlch thltT^r T, T'rr IA"'‘^' for aU d e D thete exists s e S and there 
f ~ (^1 0 -^ 2 (d, t) = 3 and (d, 6 P Let (s — F tVi^» 

odifications of the data structure operations that produce traU output do not affect how the data 

rra;™r:)l / w" ‘i:' -<-str«ct:::“inXt 

c luay assume [d, s) 6 P. We must demonstrate that F 2 (d, t) = s 

stored'‘i„”,Y operation that modifies the «t h. the elements 

Ibs^tLt definUiot^^^^^^ 'T '»"«Pond .o the elements in the set T th 

on:;;r‘ ^"nmS hrortiiTt ^ -- 

To demonstrate this, we show that each operation maintains the following invariants. 

' r?.t o, pointers points 

U. If, for some there is no pair in h with item number i then the i-th pointer is nil. 


Ui. 


The list nodes are in ascending order. 


' turLplrttfit'fc",^^ b"' 'If" “ **"■ «“ “"'fi'lon. 

ms implies that it is pointed to by exactly one pointer from the array). 

I The first two conditions assert that the indexed linked list an/1 tKa ./,♦ a * • at. 

^ - - inran-ts 3^01“' 


17 


Clearly each of these conditions is true before the first operation is performed (the set of pairs 
«s empty, all pointers except the 0-th are nil, and (0, '‘smallest") is the only Ust node). 

Assume that the above conditions are satisfied after the first Jb operations, and that the output 
generated by any of the first k operations is correct. We claim that the invariants will will remain 
satisfied after the (*-|- l)-st operation, and that if the (fc-l-l)-st operation generates output, it will be 
correct. Let s{k + 1) denote the output produced by the (* + l)-st operation (where Fi{d) = {s, t))- 
Consider each possible operation. For brevity, we omit detaUs for “illegal” operations, i.e., those 
that violate the precondition of the operation. Similarly, we omit details of the special case of 
smallest being read from the trail. 

insert(i, x, h) The trml t contains the item number j of the predecessor of (i, x). CaU the predecessor 
(j.y)-^By assumption, the i-th pointer is nil before the insert. If not, this operation outputs 
error and execution halts. Since the indexed linked list correctly represents h at this point, 
this agrees with the result returned by Fi(d), i.e., s(k -i- 1) = After the insertion is’ 

perfiM-med, the t-th pointer is set to the new node containing (t,x), so the first condition is 
satisfied. No other nodes are added to the Ust, so the second condition wiU remain true. The 
third condition is satisfied since (j,y) is now the immediate predecessor of (i,i). Since no 
other pointer in the array has been changed, the fourth condition is stiU true. 

delete(i,/») This operation sets the i-th pointer to nil, and removes the node containing (i,x) 
from the Ust. This satisfies the second invariant. Deleting a node cannot violate the third 
invariant. Since no other nodes are removed and no other pointers are changed, the first and 
fourth invariants remain satisfied. 

deletemin(A) By assumption, the nodes are currently in ascending order. Thus, the minimum 
element in h must correspond to the node foUowing the special Ust head node, caU the pair it 
contains (i,x). This pair is the correct output for this operation. As with delete, the above 
four conditions remain true after this node is removed and the i-th pointer set to nil. 

changekey(^x, h) We have implemented changekey(i,x,/i) as an insertion foUowed by a deletion 
bince both of those preserve the invariants, changekey(i, x. A) must do so as weU. 

member(» A) By assumption, the indexed Unked Ust correctly represents A before this operation 
so the output of this operation wiU be correct. Since this operation does not change the set 
or the indexed linked list, the invariants remain satisfied. 

predecessor(», A) By assumption, the indexed Unk Ust correctly represents A, and furthermore it is 
currently in sorted order. Thus, the Ust element preceding the node containing (t, x) is the 
predecessor. Since this operation changes neither A nor the indexed Unked Ust, the invariants 


This demonstrates that the first condition of Definition 2.2 is satisfied. 

The condition is for aU d G D and for all t 6 T either (F^id.t) = s and 

' ^ 2( ~ error. Intuitively, this states that if t) is passed an arbitrary trail it 

mther outputs a correct answer, or it outputs “error”. We prove an even stronger condition. Let 

returned by F,(d), i.e., F^{d) = (s,tc„rect). Then either is a prefix oft, 

or rafa, t) = error. ’ 

If tercet is a prefe of f, then we are done. The algorithm describing Fa(d,t) does not examine 

any part of the trail after so Fj(d,t) = 5. 


18 


If ^correct is not a prefix of t, let p be the position at which they first differ. Let O be the number 
of the operation that uses the trail data at p. Then operation 0 is either an insert(i,x,/i) or 
changekey(i, X, /i) operation. If it is an insert operation, then teorrtet contains the item number of 
the predecessor of (t, x). Since t contains a different value, call it j, at this location, the insert(i, x, h) 
operation will fail one of it’s three checks. Either j will not be vahd item number, or the j-th 
pointer will be nil, or the pair (j,s/) will not be the predecessor of (», x). The argument for the 
changekey(i, x,/i) operation is essentially the same. 

Thus, the second condition is satisfied. 

Therefore, Fi(d) and are a certification trail solution to P, the problem of evaluating 

data structure operations. | 

Definition 5.5 Let D be the set of finite graphs G = {V^E) with edge weights consisting of positive 
integers. Assume the indices are numbered 1 through n. Let S be the set of finite ordered tuples 
of positive integers. Let P be the relation that associates each graph with the tuple consisting of 
the minimum path lengths to each vertex. Let SP\{d) be the function defined by the SHORTEST- 
PATH algorithm with the data structure defined for the first execution. Let SP 2 {d, t) be the function 
defined by the SHORTEST- PATH algorithm using the indexed linked list implementation. 

Corollary 5.6 SP\{d) and SP 2 {d,t) constitute a certification trail solution for P. 

Proof; If SP\{d) = ( 3 ,t), then the correctness of Dijkstra’s algorithm implies that {d^s) € 
P. The algorithms that compute SP\{d) and SP 2 {dyt) are the same except for data structure 
implementation. Theorem 5.4 implies that if these algorithms generate the same data structure 
operations, then the same sequence of answers will be generated. Thus, to demonstrate that 
SP 2 {d^t) = 5 , it must be shown that the same sequence of data structure operations is generated 
by both algorithms. Examination of SHORTEST-PATH indicates that the k-ih data structure 
operation to be performed is dependent only on the input and the result of previous data structure 
operations. For example, at line 9, either an insert(i,x, A) or a changekey(t,x,/i) is performed, 
depending on the result of a member(i,/i) operation. The input graph d is identical for both 
algorithms, thus the first data structure operation performed must be the same. Assume that the 
first k operations performed by both algorithms are identical. Then, by Theorem 5.4, the answers 
to those operation will be the same. Since the {k + l)-st operation depends only on the input and 
the results of the previous k operations, it must also be the same for both algorithms. Therefore 
the same sequence of data operations is performed in both algorithms, so 5 P 2 (d, t) = s. 

The proof that the second condition holds is the same as for Theorem 5.4. Either the input trail 
t contains the “correct” trail as a prefix, or one of the data structure operations will fail, resulting 
in an “error” output. | 

One point has been glossed over in the above proof. In the SHORTEST-PATH algorithm results 
of deletemin(/i) are not output nor are they stored in the certification trail. It might be possible for 
incorrect answers to be returned by deletemin(/i) operations while still producing correct shortest 
paths and lengths. The second execution of the SHORTEST-PATH algorithm will not detect this 
since the correct output is produced. By proving that the answers to deletemin(/i) operations are 
the same, we have proven more than strictly required. 

6 Experimental Data on Certification TVails 

We have performed extensive timing experiments on several basic and wefl-known problems, includ- 
ing the ones described in this paper. Algorithms for solving these problems were implemented, both 


19 


with and without the use of certification trails. Timing data was collected on both the certification 
trail solutions and the basic solutions. The following tables summarize these results. 


Size 

Basic Algorithm 

First Execution 
(Also Generates Trail) 

Second Execution 
(Uses Trail) 

Speedup 

Percent 

Savings 

5000 

0.61 

0.62 


8.73 

43.62 

10000 

1.33 

1.34 

0.14 

9.56 

44.54 

25000 

3.68 

3.68 


10.22 

B39 

50000 

7.68 

7.74 

0.71 

10.75 


100000 

16.23 

16.30 

1.43 


45.39 

200000 

33.93 

34.37 

2.84 


45.16 


Table 2: Convex Hull 


Size 

Basic Algorithm 

First Execution 
(Also Generates Trail) 

Second Elxecution 
(Uses TVail) 

Speedup 

Percent 

Savings 

10000 

0.28 

0.30 

0.04 

7.00 

39.29 

50000 

1.80 

1.90 

0.19 

9.47 

MSMM 

100000 

3.96 

4.08 

0.41 

9.66 


500000 

23.95 

24.69 

2.14 

11.19 


1000000 

50.23 

51.57 

4.38 

11.47 

44.31 


Table 3: Sort 


Size 

Basic Algorithm 

First Execution 
(Also Generates Trail) 

Second Execution 
(Uses TVail) 

Speedup 

Percent 

Savings 

100,1000 


0.05 

0.02 

2.00 




0.16 

0.06 

2.50 

Km 

500,5000 

IHKEOIHH 

0.33 

0.11 

2.82 

29.03 

1000,10000 


0.76 

0.23 

3.04 

29.29 

2000,20000 

1.58 

1.67 

0.45 

3.51 

32.91 

2500,25000 

2.06 

2.15 

0.55 

3.75 

34.47 


Table 4: Shortest Path 


The timing information was gathered on Sun SPARCstation ELC with 16MB of RAM. The 
system was run as a standalone machine in single user mode during timing experiments. 

Much of the data presented in the timing table is essentially self-explanatory relative to the 
certification trsul technique and algorithms considered. However, a brief discussion of the table 
entries is appropriate. 

The column labelled Basic Algorithm contains timing data which ^ves the execution time of the 
algorithm in producing the output without the generation of the certification traul. All timing data 
is listed in seconds. 


20 









































































The First Execution column gives the execution time of the algorithm in producing the output 
with the additionaJ overhead of generating the certification trail. 

The Second Execution column gives the execution time of the algorithm in producing the output 
while using the certification trail. 

The Speedup column is the ratio of the run times of the Basic Algorithm and the Secondary 
Execution, One reason this figure is important is that it is possible for the two algorithms to run in 
different environments (different hardware, programming language, etc). A high speedup indicates 
that less powerful hardware or a higher level language (with associated overhead) may be sufficient 
for the second execution. 

The Percent Savings column records the percentage of the execution time savings which is gained 
by using the certification trail method as compared to 2- version programming approach. The time 
required for a 2-version programming approach was estimated by doubling the time reported in the 
Basic algorithm. This assumes that both versions take approximately the same amount of time to 
execute, 

In addition to the tables, the timing information for the convex hull algorithm is plotted in 
Figure 5. Plots for the other two examples are similar. 

Examination of the data collected for the convex hull algorithm indicates that: 

• The overhead in generating the certification trail is very small, less than 2% of the running 
time of the basic (no certification trail) algorithm. 

• The second execution is very fast, achieving an order of magnitude speedup for larger input 
sizes. This suggests that a single “second algorithm” process could easily handle the output 
generated by several “first algorithm” processes running in parallel. Alternately, the high 
speedup would allow the second execution to be run on lower performance (and hence less 
expensive) hardward. Finally, the large speedup and reduced code complexity may make it 
possible to take advantage of a formally verifiable language (which may require significant 
overhead) in implementing the second algorithm. 

The data for sorting indicates that the certification trail also requires very low overhead and 
results in a large speedup. For the shortest path problem the overhead is still very low, and the 
speedup, while not as dramatic as for the first two problems, is still quite respectable. 

7 Comparison With Other Techniques 

The certification trail approach shares similarities with other valuable fault toleramce and fault 
detection techniques that have been previously proposed and examined, but in each case there are 
significant and fundamental distinctions. These distinctions are primarily related to the generation 
and character of the certification trail and the manner in which the secondary algorithm uses the 
certification trail. 

First consider the important and useful technique called N- version programming [9, 3]. When 
using this technique N different implementations of an algorithm are independently executed with 
subsequent comparison of the resulting N outputs. There is no relationship among the executions of 
the different versions of the algorithms other than that they all use the same input; each algorithm 
is executed independently without any information about the execution of the other algorithms. In 
marked contrast, the certification trail approach allows the primary algorithm to generate a trail 
of information which can be read by the secondary algorithm. The advantages of utilizing this 
additional information are shown in the body of this paper. In effect, N- version programming can 
be thought of relative to the certification trail approach as the employment of a null trail. 


21 




Number of Input Points (Thousands) 


Figure 5: Convex Hull Run Times. 


22 



- st^f Hf“‘ p"- » ..“^S: 

- to combine the second execZn anTtie “"'»“«<>■■ « is dearly possible 

correctness of the output of the first execution * Program which certifies the 

- raXTra4'= d^.:H 

tolerance appro^h ar^ dX reUwL ,h« If f <>'">' •te«i‘l-»-'>ased fanlt 

tification tr^ approach H not Xe that th d . T Tba cer- 

= fundamental opmations of the aleo^hm b. A aT *“ "" •!>' 

i: As seen in Section 6. theZ“lctXS ‘ d «'"«ated by the algorithm. 

- combination of com^ntatir^ror" X hX 

= work has been followed by a bumt"‘oTtc1wur'in"l'us geTeJ^ *^8"4 |^e' c'Tg'"* 

: cS ~ 

g «taraple of a program checker is the algorithm developed I ^ 

■ ar4i-."ct““ ^ 

S iro^r'tMtTall^Fi^t^'hf ch« “'‘5°^ “"i*cation trail method in two 

That is, the checkXn on Si / T‘/" ‘ r""” ““ - 

^ tions are made aboutTS m,Zd g '“Pa'/output spealication of a problem and nrassump. 

^ which is being checked is treated as Xk box 1 11 „rh^'T. !T““ 

be examined and exploited In tho rorfu f 1 i altered nor can its internal status 

; treated as a black box. Instead the aleorithm^can^ algorithm being checked is not 

^(i-e., the certification traU) which is considered to be additional information 

exploiting this capability it is sornetimp*; noQQ'KI ♦ ^ ^^J^Bg/verification process. By 

Ljfaster checking than Blum-Kannan program chLers^ W tjaU solutions which allow 

^specialized than the BluTn.TC^knn^n v u* l ^ course, these faster solutions are more 


23 



*'mes. In the certification trail approach the program is run only once. Thus, the overall 
a checking process can be significantly larger for Blura-Kannan checkers. 

A third less irnportant difference stems from the fact that Blum-Kannan checkers are defined 
a more general probabilistic context. Certification trails are currently defined only for deter- 

orT ‘^^eckers. However, it is clearly possible to define them in the more general 
probabiljstjc context. * 

Other work h^ been done to extend the ideas of Blum-Kannan to give methods which allow 
he inversion of some programs into new programs which are self-testing and self-correcting fl2 
methods are also based on treating programs as black boxes and thus have 
hrmtations similar to Blum-Kannan program checkers. A recent paper by Blum et al. [8] concerns 
choking the correctness of memories and data structures. The results described in that paper 
ar?t-JhT r*"- ^ f ^hsttax:t data types in one central way. The checkers that they design 
structure m memory usage. TypicaUy, they use only 0(log(n)) storage to check data 

th^ results do not place space constraints on the algorithm used to certify 

the data structure. Without a space constraint we are able to certify abstract data types such as 

complex than the data structures that they check, i.e., stacks and 
queuw. Also, we are able to achieve a speed up in the checking process and they are not. 

checkin^’ f^rtnow. Levin and Szegedy [4] present methods which appear to aUow remarkably fast 
oronn P°'yj°«^"thmic time. Their approach has some similarities to the methods we 

info™Ii-° algorithms to yield new algorithms which output additional 

inform additional information as a certification traU and they refer to this 

"v ™ interested in modified algorithms which have the same 

slowd downT^ complexity as the original algorithm. Indeed, the modified algorithm should be 

any fi«d Zlt^ I T f ^7' algorithm is slowed down by more than 

multiphcative factor. SpecficaUy, if the original algorithm has a time complexity of 0(T) 

be algorithm has a time complexity of 0(T‘+*). Note that in practice the c cannot 

^ too sm^ because its inverse appears in the exponent of the checker time complexity. Another 

be enco^ A ® methods is the fact that their method requires that the input and output 
of error-correcting code. The encoding process takes time for strings 

g h A^. However, many of the checkers we have developed take only linear time so the cost 
of S.„.ply preparing to use .heir method appears to be Urn great in some Ls. I. is^so n«ess^ 


8 Generalization and Future Research Areas 

certification trails indicates that this technique is of great practical 
v^ue as well as of theoretical interest. Furthermore, the techniques iUustraL are applicable to a 

pa?h certification of Abstract Data Types described in the shortest 

below exploration, a few of which are described 


8.1 Certified Data Structure Libraries 

It IS apparent that the certification trail technique described for the SHORTEST-PATH proeram 

Since the certification trad is produced and used by absUact 

terms algorithm that can be impleLnted in 

ms of those abstract data types. Creating a library of such “certified data types” enables 


24 


programmers to create fault tolerant programs without having to be familiar with the certification 
trail technique. Object oriented programming appears to be well suited to this task. 

A possible objection to this is that it provides fault detection only for the data structure imple- 
mentation, since the surrounding code is simply reused. Furthermore, the data structure imple- 
mentation is likely to come from library code, and hence be highly reliable. In answer to this note 
that: 

• In many algorithms, the code using the data structure is much simpler than the code imple- 
menting the data structure, 

• Although the example above illustrated reuse of using the data structures, it is certainly 
possible for this code to be developed separately for the first and second execution programs. 

• Errors are often found even in code that has been in use for a long period of time. The added 
confidence of using this technique may be desirable even for library code. 

• Even if the library code is highly reliable, the certification trail can be helpful in detecting 
errors caused by hardware problems. 

• Library code may have to be tuned or even rewritten to meet for a particular application or 
environment, partially negating the claim of using well-tested code. 

Even if fault detection is not an issue, the certification trail technique is useful during program 
testing and debugging. Input may be automatically generated and processed. If the output of the 
first and second executions differ or an error is otherwise flagged, the input set is flagged. This 
reduces the need to otherwise compute output for selected input and enables both more and larger 
sets of input to be processed. 2- version programming may be used during debugging in a similar 
manner, however certification trails have the advantage of reduced overhead, allowing more test 
cases to be run, a reduction in the hardware required for testing, or both. 


Almost-concurrent execution of the certification trail 

In the above discussion and examples, the certification trail programs have been executed seriaUy, 
i.e., we do not run the second execution until after first execution completed. Actually, except for 
sorting, the two executions in the examples above can be run almost-concurrently. The “second” 
execution simply reads the Information from the certification trail as it becomes available. The two 
programs will finish nearly simultaneously, the difference being in the time after the last element 
IS read from or written to the certification trail. 

8.3 Continuing after an error 

A possible extension to the use of certification trails is to attempt to continue the second execution 
after an error is detected. Consider the shortest path example using abstract data types. In 
that example, the second execution used an indexed Unked list that performed each operation in 
constant time by using the certification trail from the first execution. Suppose that an error had 
been detected during the second execution. Rather than simply aborting, it may be possible to 
continue execution. This could be done by 

• Reorganizing the existing set into some other data structure (such an AVL tree, red-black 
tree, etc.) that allows efficient operation without a certification trail. 


25 


• Continuing to use the indexed linked list and ignoring the rest of the certification trail. Note 
that this would result in some operations requiring more time, 

• Continuing to use the indexed linked list and attempting to use the certification trail for future 
operations. This may be possible if the error that occurred has sufficiently “local” effect. For 
example, if part of a tree structure is corrupted during the first execution, it is still possible 
that operations involving other parts of the tree will be performed correctly. 

On a related topic, research has been done on “self-correcting” data structures in which enough 
r^undancy is built into a data structure so that it may be reconstructed if part of it is corrupted. 
Using certification trails with such structures could provide an efficient detector for corruption of 
the data structure. 


References 

[1] Adel son-Vel skii, G. M., and Landis, E. M., “An algorithm for the organization of informa- 
tion”, Soviet Math. Dokl, pp. 1259-1262, 3, 1962. 

[2] Anderson, T., and Lee, P., Fault tolerance: principles and practices, Prentice- Hall, Enelewood 
Cliffs, NJ, 1981. 

[3] Avizienis, A., “The N-version approach to fault tolerant software,” IEEE Trans, on Software 
Engineering, vol. 11, pp. 1491-1501, Dec., 1985. 

[4] Babai^L., Fortnow, L., Levin, L., and Szegedy, M., “Checking computations in polylogarithraic 
time, ” Proceedings of the 23rd ACM Symposium on Theory of Computing, pp. 21-31, 1991. 

[5] Bayer, R., and McCreight, E., “Organization of large ordered indexes”, Acta Inform dd 

173-189, 1, 1972. ’ 

[6] Blum, M-, and Kannan, S., “Designing programs that check their work”. Proceedings of the 
1989 ACM Symposium on Theory of Computing, pp. 86-97, ACM Press, 1989. 

[7] Blum, M.,^Luby, M., and Rubinfeld, R., “Self- testing/correcting with applications to numerical 
problems,” Proceedings of the 22nd ACM Symposium on Theory of Computing, pp. 73-83, 1990. 

[8] Blum, M., Evans, W., Gemmell P., Kannan, S., and Naor, M., “Checking the correctness of 
memories,” Proceedings of the 32nd IEEE Symposium on Foundations of Computer Science 
pp. 90-99, 1991 

[9] Chen, L., and Avizienis A., “N-version programming: a fault tolerant approach to reliability of 
software operation,” Digest of the 1978 Fault Tolerant Computing Symposium, pp. 3-9 IEEE 
Computer Society Press, 1978. 

® ’ Leiserson, C. E., and Rivest, R. L., Introduction to Algorithms McGraw- 
Hill, New York, NY, 1990. 

[11] Dijkstra, E. W., “A note on two problems in connexion with graphs,” Numer. Math 1 on 

269-271, Sept., 1959. • > I'P- 

[12] GemmeU, R., Lipton, R., Rubinfeld, R., Sudan, M., and Wigderson, A., “Self- 
testing/correcting for polynomials and for approximate functions,” Proceedings of the 2Srd 
ACM Symposium on Theory of Computing, pp. 32-42, 1991. 


\mi 


— > 


ofth^NLLnlh^^^^ dichromatic framework for balanced trees”, Proceedings 

- Society Press, 1978. ^ Foundations of Computing, pp. 8-21, IEEE Computer 

_ ^ ** Ma"T 989^ ’ sjslenw Addison WesIey, Reading, 

g z' y oymposium, pp. 338-343, IEEE Computer Society Press, June, 1985. 

^^dJlr"'re“my"bi„fkf^ of a fan), -tolerant ntultiprocenaor nnlng 

_ covery Diocks, IEEE Trans. Comput., vol. C-33, pp. 113-124, Feb. 1984. 

- pro^r’ 7mFlu7o! '1r 

= 1988. f^ootpalmj Symposium, pp. 180.185, June, 

= ^ ^ NetT Yor'k, NY ’*985^*'™°* SfooielTC <m inlmduclion, Springer- Verlag, 

w ’ 

-eit: VO,: 

/ mZ’i982. ^ ’ ond practice of reliable design. Digital Press, Bedford, 

OcJribfg.^” compression on balanced trees”, J. ACM, pp. 690-715, 

[24] Paul Wallich, Crunching Epsilon,” Scientific American, pp. 22-24, Jan., 1993 
^ ' * ra«o^ »/ CompLIX PR 84T4“‘ *’'°*'*“ Checkers,” J>me. a ACM Symp. of 


27 


N94- 36065 



/fy5'7^ 

p. 

Experimental Evaluation of the / 

Certification- Trail Method 

Gregory F. Sullivan,^ Dwight S. Wilson,^ Gerald M. Masson,^ 

Mamoru Itoh,"* Warren W. Smith, Jonathan S. Kay^ 

Dept, of Computer Science, Johns Hopkins Univ,, Baltimore, MD 21218 

Abstract 

Certification trails are a recently introduced and promising 
approach to fault-detection and fault-tolerance [1, 2, 3, 4]. In 
this paper, we report on a comprehensive attempt to assess ex- 
perimentally the performance and overall value of the method. 

The method is applied to algorithms for the following problems: 
huffman tree, shortest path, minimum spanning tree, sorting, 
and convex hull. Our results reveal many cases in which an 
approach using certification-trails allows for significantly faster 
overall program execution time than a basic time redundancy- 
approach. 

We also examine algorithms for the answer- validation prob- 
lem for abstract data types. This kind of problem was originally 
proposed in [3] and provides a basis for applying the certification- 
trail method to wide classes of algorithms. We implemented and 
analyzed answer-validation solutions for two types of priority 
queues. In both cases, the algorithm which performs answer- 
validation is substantially faster than the original algorithm for 
computing the answers. 

Next we present a probabiUstic model and analysis which en- 
ables comparison between the certification-trail method and the 
time-redundancy approach. The analysis reveals some substan- 
tial and sometimes surprising advantages for the certification- 
trail method. 

^Research partially supported by NSF Grants CCR-8910569 and CCR-8908092 and an 
IBM Technology Interchange Program Grant. 

^Research partially supported by NSF Grant CCR-8910569 and an IBM Technology 
Interchange Program Grant. 

^Research partially supported by NASA Grant NSG 1442 and an IBM Technology 
Interchange Program Grant. 

^Visiting Scholar, Matsushita Electronic Components Co. 

^Currently at Dept, of Computer Science, University California San Diego 


1 



Finally we discuss the work our group has performed on the 
design and implementation of fault injection testbeds for experi- 
mental analysis of the certification trail technique This work em- 
ploys two distinct methodologies: software fault injection (mod- 
ification of instruction, data, and stack segments of programs on 
a Sun Sparcstation ELC and on an IBM 386 PC) and hardw^are 
fault injection (control, address, and data lines of an Motorola 
MC68000-ba^ed target system pulsed at logical zero/one values). 
Our results indicate the viability of the certification trail tech- 
nique. We also believe the tools we have developed provide a 
solid base for additional exploration. 

Keywords: Software fault tolerance, certification trails, error 
monitoring, design diversity, data structures. 


1 Introduction 

Certification trails are a recently introduced and promising approach to 
fault-detection and fault-tolerance [1, 3]. In this paper, we report on a com- 
prehensive attempt to assess experimentally the performance and overall 
value of the method. We have implemented several fundamental algorithms 
together with versions of the algorithms which generate and utilize certifica- 
tion trails. Specifically, algorithms for the following problems are analyzed: 
huffman tree, shortest path, minimum spanning tree, sorting, and convex 
huU. Our results reveal many cases in which an approach using certification 
trails allows for significantly faster overall program execution time than a 
bcisic time redundancy approach. 

We also examine algorithms for the answer-validation problem for ab- 
stract data types. This kind of problem was originally proposed in [3] and 
provides a basis for applying the certification-trail method to wide classes of 
algorithms. For this paper we implemented and analyzed answer- validation 
solutions for two abstract data types. The first solution is for a simplified 
priority queue which allows insert, min and deletemin operations, and the 
second solution is for a priority queue which allows insert, min, delete and 
deletemin operations. In both cases, the algorithm which performs answer- 
validation is substantial faster than the original algorithm for computing the 
answers. 

This paper next presents a simple probabilistic model and analysis which 
enables comparison between the certification-trail method and the time- 


2 



redundancy approach. The analysis shows that when the certification-trail 
method has a smaller execution time than the time-redundancy approach 
it yields strictly superior performance. This means the method has both 
a a smaller probability of error and a smaller probability of undetected 
error. Surprisingly, the analysis also reveals the intriguing result that the 
certification-trail method often can display superior performance even when 
the method has the same execution time or a longer execution time than the 
time-redundancy approach. This superior behavior stems from the typical 
assymetry of the execution times of the first and second executions in the 
certification-trail method. 

The paper next discusses the work our group has performed on the design 
and implementation of fault injection testbeds. This work employs two 
distinct methodologies: software fault injection and hardware fault injection. 
The software fault injection tool is similar to an interactive debugger but 
more accurately can be considered an interactive bugger. It allows programs 
to be halted and faults to be injected by direct modification of the stack, 
data and instruction segments of a program. Output can then be captured 
and characterized. 

The hardware fault injector is based on injecting faults into an operating 
microprocessor. The injection is performed by explicitly setting one or more 
pins of the microprocessor to logical zero and/or logical one values. The 
timing and duration of the pin setting is under control of a supervisory 
processor. The testbed also includes a multi-processor system. This system 
consists of three processors which are connected to one another pairwise by 
shared banks of dual ported memory. We plan to use this system to conduct 
evaluation of systems which utilize concurrent execution of algorithms using 
the certification-trail method. 

2 Introduction to Certification Trails 

To explain the essence of the certification-trail technique for software fault 
tolerance, we will first discuss a simpler fault-tolerant software method. In 
this method the specification of a problem is given and an algorithm to solve 
it is constructed. This algorithm is executed on an input and the output is 
stored. Next, the same algorithm is executed again on the same input and 
the output is compared to the earlier output. If the outputs differ then an 
error is indicated, otherwise the output is accepted as correct. This software 
fault tolerance method requires additional time, so-called time redundancy 


3 



[32, 52]; however, it requires no additional software. It is particularly valu- 
able for detecting errors caused by transient fault phenomena. If such faults 
cause an error during only one of the executions then either the error will be 
detected or the output will be correct. The second possibility, of undetected 
faults, occurs when the output of the execution is unaffected by the faults. 

A variation of the above method uses two separate algorithms, one for 
each execution, which have been written independently based on the problem 
specification. This technique, called N-version programming [16, 12] (in 
this case N=2), allows for the detection of errors caused by some faults 
in the software in addition to those cause by transient hardware faults and 
utilizes both time and software redundancy. Errors caused by software faults 
are detected whenever the independently written programs do not generate 
coincident errors. 

The certification-trail technique is designed to obtain similar types of 
error-detection capabilities but expend fewer resources. The central idea, 
as illustrated in Figure 1, is to modify the first algorithm so that it leaves 
behind a trail of data which we call a certification trail. This data is chosen 
so that it can allow the the second algorithm to execute more quickly and/or 
have a simpler structure than the first algorithm. As above, the outputs of 
the two executions are compared and are considered correct only if they 
agree. Note, however, we must be careful in defining this method or else 
its error detection capability might be reduced by the introduction of data 
dependency between the two algorithm executions. For example, suppose 
the first algorithm execution contains an error which causes an incorrect 
output and an incorrect trail of data to be generated. Further suppose 
that no error occurs during the execution of the second algorithm. It still 
appears possible that the execution of the second algorithm might use the 
incorrect trail to generate an incorrect output which matches the incorrect 
output given by the execution of the first algorithm. Intuitively, the second 
execution would be “fooled” by the data left behind by the first execution. 
The definitions we give below exclude this possibility. They demand that 
the second execution either generate a correct answer or signal that an error 
has been detected in the data trail. 

3 Formal Definition of a Certification Trail 

In this section we will give a formal definition of a certification trail and 
discuss some aspects of its realizations and uses. 


4 



Figure 1: Certification trail method. 

Definition 3.1 A problem P is formalized as a relation, i.e., a set of ordered 
pairs. Let D be the domain (that is, the set of inputs) of the relation P and 
let S be the range (that is, the set of solutions) for the problem. We say an 
algorithm A solves a problem P iff for all d G D when d is input to A then 
an s 6 S is output such that (d, s) 6 P. 

Definition 3.2 Let P : D S be a problem. A solution to this problem 
using a certification trail consists of two functions F\ and F 2 with the fol- 
lowing domains and ranges : D S x T and f 2 • D x T — > S U {error}. 
T is the set of certification trails. The functions must satisfy the following 
two properties: 

(1) for all d € D there exists s 6 S and there exists t 6 T such that 

F\{d) = (s, t) and / 2 (d, i) = s and (d, s) G P 

(2) for all d G D and for all t G T 

either {F 2 {d, t) = s and (d, s) G P) or / 2 (d, t) = error. 


We also require that F\ and F 2 be implemented so that they map ele- 
ments which are not in their respective domains to the error symbol. The 
definitions above assure that the error-detection capability of the certification- 
trail approach is similar to that obtained with the simple time-redundancy 
approach discussed earlier. (That is, if transient hardware faults occur dur- 
ing only one of the executions then either an error will be detected or the 
output will be correct.) It should be further noted, however, the examples 
to be considered will indicate that this new approach can also save overall 
execution time. 


5 





Throughout this section we have assumed that our method is imple- 
mented with software, however, it is clearly possible to implement the method 
wuth assistance from dedicated hardware. The degree of diversity or inde- 
pendence achieved when using certification trails depends on how they are 
used, A fuller discussion of this and of the relationship between certification 
trails and other approaches to software fault tolerance is contained in the 
expanded version of [1], 

4 Generalized Priority Queue 

Before we present our example algorithms which use certification trails we 
must discuss the notion of an abstract data type. An abstract data type has 
a well defined data object or set of data objects, and an abstract data type 
has a carefully defined finite collection of operations that can be performed 
on its data object(s). Each operation takes a finite number of arguments 
(possibly zero), and some but not all operations return answers. 

Some of the algorithms presented in the next section use the priority 
queue abstract data type. In addition, later in this paper the answer- 
validation problem for two \^riants of the priority queue are presented. 
Therefore, we now describe the priority queue. The data consists of a set 
of ordered pairs. The first element in these ordered pairs is referred to as 
the item number and the second element is called the key value. Ordered 
pairs may be added and removed from the set, however, at aU times the item 
numbers of distinct ordered pairs must be distinct. It is possible, though, 
for multiple ordered pairs to have the same key value. In this paper the item 
numbers are integers between 1 and n, inclusive. Our default convention is 
that i is an item number, A: is a key value and is a set of ordered pairs. 

A total ordering on the pairs of a set can be defined lexicographically as 
foUows: (i,k) < iff A: < or {k = k' and i < i')- The abstract data 

types we will consider support a subset of the following operations. 

member(z) returns a boolean value of true if the set contains an ordered 
pair with item number i, otherwise returns false. 

insert(t,A:) adds the ordered pair {i,k) to the set. We require that no other 
pair with item number i be in the set. 

delete(i) deletes the unique ordered pair with item number i from the set. 
We require that a pair with item number i be in the set initially. 


6 


changekey(i, A;) is executed only when there is an ordered pair with item 
number i in the set. This pair is replaced by 

deletemin (or deletemax) returns the ordered pair which is smallest (or 
largest) according to the total order defined above and deletes this 
pair. If the set is empty then the token “empty” is returned. 

min (or max) returns the ordered pair which is smallest (or largest) accord- 
ing to the total order defined above. If the set is empty then the token 
“empty” is returned. 

predecessor(i) returns the item number of the ordered pair which immedi- 
ately precedes the pair with item number i in the total order. If there 
is no predecessor then the token “smallest” is returned. We require 
that a pair with item number i be in the set initially. 

If an operation violates one of the requirements described above then it is 
considered to be ill-formed. Also, if an operation has the wrong number or 
type of arguments it is considered to be ill- formed. 

Many different types and combinations of data structures can be used 
to support different subsets of these operations efficiently. 


5 Examples of the Certification Trail Technique 
with Timing Data 

In this section we evaluate the use of certification trails for five well-known 
and significant problems in computer science: the convex hull problem, the 
minimum spanning tree problem, the shortest path problem, the Huffman 
tree problem, and the sorting problem. We have implemented algorithms 
for these problems together with other algorithms which generate and use 
certification trails. 

We provide a full description of the algorithm for the convex hull problem 
which generates a certification trail and a full description of the algorithm 
which uses that trail. This material has not appeared in our previous publi- 
cations [1, 3]. Because of space considerations the discussion of three of the 
other algorithms is abbreviated, but references to previous publications or 
technical reports which describe the algorithms more fully are given. The 
treatment of the sort algorithm is brief but is detailed enough for the inter- 
ested reader to implement the certification- trail method. 


7 


The algorithms we have choosen to implement are not always the al- 
gorithms which have the smallest asymptotic time complexity. Often the 
asymptotically fastest algorithms have large constants of proportionality 
which make them slower on the data sizes we examined. We modified and 
used some programs from major software distributions such as quicker-sort 
from a Berkeley Unix distribution. Other algorithms were based on text- 
book discussions. It should be stressed here that this research is exploratory 
and we hope to further increase our corpus of algorithm and data-structure 
implementations . 

5.1 Systems used for timing data 

We have collected timing data for the algorithms considered using a Sun 
workstation, an IBM 386 PC and a Motorola 68000-based system. 

The SUN machine utilized was a SPARCstation ELC with 16MB of 
RAM. The system was run as a standalone machine in single user mode 
during the timing experiments. Timing data was obtained through the 
getrusage() system call; the user times are reported in the data. 

Some of the algorithms were also run on an MSDOS machine: a North- 
gate 386/33 with 8MB of RAM. The programs were compiled using DJGPP, 
DJ Delorie’s port of the GNU GCC compiler to MSDOS. This compiler uses 
a DOS extender to allow programs to run in protected mode; thus nearly all 
of the SMB in the machine was available, thereby allowing data sets com- 
parable in size to those used on the Sun. The programs required no change 
to run under MSDOS, though the data generators required minor modifi- 
cation because the drand48() family of random number generators was not 
available. 

Finally some of the algorithms were also run Motorola M68000-based 
target system. In addition to the MC68000 microprocessor which served as 
the cpu, the system was also was comprised of 512K bytes of RAM, 512 
bytes of ROM, and numerous I/O modules to support serial and parallel 
communication. A timer module is also included in the system which uses 
the 4Mhz clock as a reference so as to provide execution time data for 
experiments. This system is discussed in Section 10 relative to fault injection 
experiments. 


8 





5,2 Explanation of timing data table entries 

Much of the data presented in the timing table is essentially self-explanatory 
relative to the certication trail technique and algorithms considered. How- 
ever, a brief discussion of the table entries is appropriate. 

The Basic Algorithm timing data refers to the execution time of the 
algorithm in producing the output without the generation of the certification 
trail. All timing data is listed in seconds. 

The Generate Certif. timing data refers to the execution time of the al- 
gorithm in producing the output with the additional overhead of generating 
the certification trail. 

The Use Certif, timing data refers to the execution time of the algorithm 
in producing the output while using the certification trail. 

The Compare timing data refers to the time necessary to compare the 
outputs from both two Basic Algorithm runs or from a Generate Certifi- 
cation Trial run and a Use Certification Trail run. (Obviously, the value 
of the comparison would be the same in each ca^e.) For the some of the 
experiments, the data was too small to calculate and is therefore listed as 
0.00. In other experiments, the comparison was included in the algorithm 
execution timing data and therefore is not separately listed. 

The Total Basic timing data is twice the Basic Algorithm timing data 
plus the Comparison time (when available) so as to evaluate the classical 
time-redundancy approach. 

The Total Certif. timing data is the sum of the Generate Certif. timing 
data and the Use Certif. data and Comparison data (when available) so as 
to evaluate the certification trail approach. 

The % Savings data is percentage of the execution time savings which is 
gained by using the certification trail method as compared to the classical 
time redundancy method. 

For the Huffman tree data, the input size for the Huffman tree program 
is the number of nodes. Each node is given a frequency, chosen uniformly 
from the integers {1,2, . . . , n}. n was selected to be the number of nodes, 
but in fact it’s value does not affect the running time of the algorithm. In 
order for the algorithm to execute correctly, the sum of the frequencies must 
not cause an arithmetic overflow. The certification trail method will detect 
this. 

For the minimum spanning tree and shortest path tables, there are two 
numbers associated with the input size, the first is the number of vertices 
in the graph, the second the number of edges. A graph with the required 




9 



edges is selected uniformly from the set of all such graphs, then tested for 
connectedness. The algorithms will function regardless of connectedness, 
but allowing graphs that are not connected would introduce undesirable 
variation in the timing data. 

For the convex hull tables, the input size is the number of points in the 
data set. The points are chosen uniformly from the set of points with integer 
coordinates between 0 and 30,000. 

For the sorting tables, sorting was timed in two ways. The first set of 
results were obtained by sorting integers. To generate a trail, an integer tag 
is added to each input integer and an array of these pairs passed to the sort 
function. After sorting, the ’’data” integers are placed in an array, and the 
’’tag” integers are placed on the certification trail. Thus, the sort call looks 
the same as a normal sort function. The time to massage the data in this 
manner is included in the cost of the call. This method resulted in only 
a small speedup, because of the overhead involved in massaging the data, 
and because the sort routine must swap pairs of integers instead of single 
integers. The integers were chosen uniformly over the range 0 to 1,000,000. 

The second method was to sort an array of pointers to structures. In this 
case it was assumed that the structure contained a field that w^ould serve 
as the tag. The sort program needed only to fill in this field, and not copy 
the structures to a second array. This method results in dramatic speedups. 
Integer keys were used, though a more complex key will work as well (in 
fact, a more complex key is very likely to increase the speedup achieved). 

For the priority queue and generalized priority queue tables, the input 
size n is the number of commands executed. The item numbers range from 
1 to n (ie. there are as many item numbers as there are commands). The 
commands are not chosen with equal probability, but rather the first n/2 
are weighted toward insert operations while the second half are weighted 
toward the other operations, the weightings remaining the same for all runs. 
This weighting is necessary in order to force a large queue. 

The timing data displayed in the tables should be considered not only 
relative to the overall efficiencies of the certification trail method relative 
to classical time redundancy but also relative to the probabilistic analysis 
given in Section 9 in which we show that when the certification-trail method 
has a smaller execution time than the time-redundancy approach it yields 
strictly superior performance. This means the certification trail method has 
both a a smaller probability of error and a smaller probability of undetected 
error. 


10 



5.3 Convex Hull Example 

The convex hull problem is a fundamental one in computational geometry. 
Our certification trail solution is based on a solution due to Graham [24] 
which is called Graham’s Scan. For basic definitions in computational ge- 
ometry see the text of Preparata and Shamos[46]. For simplicity in the 
discussion which follows we will assume the points are in so called general 
position, e.g., no three points are colinear. It is not hard to remove this 
restriction. 

Definition 5.1 The convex hull of a set of points, 5, in the Euclidean 
plane is defined as the smallest convex polygon enclosing aU the points. 
This polygon is unique and its vertices are a subset of the points in S, It is 
specified by a counterclockwise sequence of its vertices. 

Figure 2(c) shows a convex hull for the points indicated by black dots. 
The algorithm given below constructs the convex huU incrementally in a 
counterclockwise fashion. Sometimes it is necessary for the algorithm to 
“backup” the construction by throwing some vertices out and then contin- 
uing. The first step of the algorithm selects an “extreme” point and calls 
it Pi . The next two steps sort the remaining points in a way which is de- 
picted in Figure 2(a). It is not hard to show that after these three steps the 
points when taken in order, P\,P 2 , • • - form a simple polygon; although 
this polygon may not be convex. It is possible to think of the algorithm 
as removing points from this simple polygon until it becomes convex. The 
main FOR loop iteration adds vertices to the polygon under construction 
and the inner WHILE loop removes vertices from the construction. A point 
is removed when the angle test performed at line 6 reveals that it is not on 
the convex huU because it faUs within the triangle defined by three other 
points. A “snapshot” of the algorithm given in Figure 2(b) shows that 
is removed from the huU. The angle formed by qA^qs.pe is less than 180 
degrees. This means, qs lies within the triangle formed by qA^Pi^Pe- (Note, 
qi = p|.) In general, when the angle test is performed if the angle formed by 
is less than 180 degrees then qm lies within the triangle formed 
by qm-i’iPuPk- Below it wiU be revealed that this is the main fact that 
our certification trail relies on. When the main FOR loop is complete the 
convex huU has been constructed. 

Algorithm CONVEXHULL(5) 

Input: Set of points, 5, in 


11 


: 



i j 


Q 


Figure 2: Convex hull example. 


Output: Counterclockwise sequence of points in R? which define convex hull of 5 

1 Let Pi be the point with the largest x coordinate (and smallest y to break ties) 

2 For each point p (except pi) calculate the slope of the line through p\ and p 

3 Sort the points (except pi) from smallest slope to largest. Call them P 2 ,- • - iPn 

4 q\ := Pi; ?2 := Pi', 93 := Psi m = 3 

5 FOR = 4 to 71 DO 

6 WHILE the angle formed by 9m-i,9m,pjt is > 180 degrees DO m := m - 1 END 

7 m := m + 1 

8 9fn •— Pk 

9 END FOR 

lOFOR z = 1 to m DO, OUTPUT(q,) END FOR 
END CONVEXHULL 

First execution: In this execution the code CONVEXHULL is used. 

The certification trial is generated by adding an output statement within the 
WHILE loop. Specifically, if an angle of less than 180 degrees is found in the 
WHILE loop test then the four tuple consisting of 9m»9m-i>PiiPfc is output 
to the certification trail. The table below shows the four tuples of points 
that would be output by the algorithm when run on the example in Figure 
2. The points in the table are given the same names as in Figure 2(a). The 
final convex hull points qi,. . .,9m also output to the certification trail. 

Strictly speaking the trail output does not consist of the actual points in R}. 

Instead, it consists of indices to the original input data. This means if the 
original data consists of sj , S 2 , . . . , then rather than ouput the element in 
R? corresponding to Si the number i is output. It is not hard to code the 
program so that this is done. 


12 



Point not on convex hull 


Three surrounding points 


Ps 

P4,Pl,P6 

Pi 

P3iPl)P6 

P7 

P6,Pl,P8 


m 


Second execution: Let the certification trail consist of a set of four 
tuples, (^^i, tti, Cl), (a;2, fl2) ^’2) C2), . . (irj On Cr) followed by the sup- 
posed convex huU, qi,q2, ...,qm- The code for CONVEXHULL is not used 
in this execution. Indeed, the algorithm performed is dramatically different 
than CONVEXHULL, 

It consists of five checks on the trail data. 

• First, the algorithm checks for i e ,r} that Xi lies within the 

triangle defined by and c,. 

• Second, the algorithm checks that for each triple of counterclockwise 
consecutive points on the supposed convex hull the angle formed by 
the points is less than or equal to 180 degrees. 

• Third, it checks that there is a one to one correspondence between the 
input points and the points in {2:1, . . . , a:,.} U {gi, . . .,7^}- 

• Fourth, it checks that for i € {1, . . .,r}, a^, 6,, and Ci are among the 
input points. 

• Fifth, it checks that there is a unique point among the points on the 
supposed convex huU which is a local extreme point. We say a point 
q on the huU is a local extreme point if its predecessor in the counter- 
clockwise ordering has a strictly smaUer y coordinate and its successor 
in the ordering has a smaUer or equal y coordinate. 

If any of these checks fail then execution halts and “error” is output. As 
mentioned above, the trail data actuaUy consists of indices into the input 
data. This does not unduly complicate the checks above; instead it makes 
them easier. The correctness and adequacy of these checks must be proven. 
Because of space limitations we shaU not give the proof here. 

Time complexity: In the first execution the sorting of the input points 
takes 0(n log(Ti)) time where n is the number of input points. One can show 
that this cost dominates and the overaU complexity is O(nlog(n)). 


13 


Size 

Basic 

Algorithm 


Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

10000 

0.74 

0.79 

0.11 

0.03 

1.51 

0.93 

38.41 

20000 

1.65 


0.23 

0.06 

3.36 

2.05 

39.28 

50000 

4.64 

4.79 

0.59 

0.14 

9.42 

5.52 

41.40 

100000 

^ 9.95 

10.32 

1.19 

0.28 

20.18 

11.79 

41.57 


Table 1: Huffman Tree on Sun 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

10000 

1.09 

1.32 

0.32 

0.10 

2.28 

1.74 

23.68 

20000 

2.38 

2.91 

0.63 

0.21 

4.97 

3.75 

24.55 

50000 

7.01 

8.80 

1.59 

0.50 

14.52 

10.89 

25.00 


Table 2: Huffman tree on 386/33 


It is possible to implement the second execution so that all five checks are 
done in 0(n) time. /papers/certify3/tabdata /papers/certify3/tabdataChecking 
that a point lies within a triangle is a geometric calculation that can be done 
in constant time. Comparing the angle formed by three points to 180 de- 
grees can be done in constant time. The third and fourth checks can be 
done in 0{n) because the certification trail contains indices into the input 
data as described above. The uniqueness of the “local extreme” can also be 
checked in linear time. 

5.4 Minimum Spanning Tree Example 

This classic problem has been examined extensively in the literature and 
an historical survey is given in [25]. Our approach is applied to a variant 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

10000 

1.26 

1.29 

0.13 

0.01 

2.53 

1.43 

43.47 

20000 

2.71 

2.81 

0.31 

0.01 

5.43 

3.13 

42.35 

50000 

7.41 

7.48 

0.70 “ 

0.01 

14.83 

8.19 

44.77 

100000 

15.76 

15.87 

1.43 

0.01 

31.53 

17.31 ' 

45.09 


Table 3: Convex Hull on Sun 


14 









Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

10000 

1.79 

1.88 

0.15 

0.01 

3.59 

2.04 

43.18 

20000 

3.86 

4.08 

0.31 

0.01 

7.73 

4.40 

43.08 

50000 

10.51 

11.16 

0.78 

0.01 

21.03 

11.95 

43.18 

100000 

22.40 

23.97 

1.64 

0.01 

44.81 

25.62 

42.83 


Table 4: Convex Hull on 386/33 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

100,1000 

0.04 

0.05 

0.01 

0.00 

0.08 

0.06 

25.00 

200,2000 

0.10 

0.12 

0.02 

0.00 

0.20 

0.14 

30.00 

500,5000 

0.30 

0.31 

0.06 

0.00 

0.60 

0.37 

38.33 

1000,10000 

0.68 

0.72 

0.13 

0.00 

1.36 

0.85 

37.50 

1500,15000 

1.10 

1.14 

0.19 

0.00 

2.20 

1.33 

39.55 

2000,20000 

1.51 

1.58 

0.27 

0.00 

^3.02 

1.85 

38.74 

2500,25000 

1.97 

2.00 

0.35 

0.00 

3.94 

2.35 

40.36 


Table 5: Minimum Spanning Tree on Sun 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

100,1000 

0.04 

0.03 

0.01 

0.00 

0.08 

0.04 

50.00 

200,2000 

0.08 

0.08 

0.02 

0.00 

0.16 

0.10 

37.50 , 

500,5000 

0.26 

0.24 

0.06 

0.00 

0.52 

0.30 

42.31 

1000,10000 

0.59 

0.56 

0.13 

0.00 

1.18 

0.69 

41.53 

1500,15000 

0.93 

0.90 

0.20 

0.00 

1.86 

1.10 

40.86 

2000,20000 

1.29 

1.28 

0.28 

0.00 

2.58 

1.56 

39.53 

2500,25000 

1.67 

1.65 

0.36 

0.00 

3.34 

2.01 

39.82 


Table 6: Shortest Path on Sun 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

10000 

0.23 

0.40 

0.06 

0.01 

0.47 

0.47 

0.00 

20000 

0.51 

0.86 

0.13 

0.01 

1.02 

1.00 

1.96 

50000 

1.38 

2.35 

0.35 

0.02 

2.78 

2.72 

2.15 

100000 

2.96 

4.97 

0.76 

0.04 

5.92 

5.73 

3.20 


Table 7: Integer sorting on Sun 


15 





Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 


Total 

Basic 

Total 

Certif. 

% Saving 

10000 

1.02 

1.18 

■EOi 


2.08 

1.36 

34.62 

20000 

2.16 

2.49 

■m™ 

0.08 

4.40 

2.86 

35.00 

50000 

5.67 

6.48 

■IWKM 

0.22 

11.56 

7.43 

35.73 

100000 

11.74 

13.48 


0.44 

23.92 

15.49 

35.24 


Table 8: Integer Sort on 386/33 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif. 

% Saving 

10000 

0.32 

0.33 

0.03 

0.01 

0.65 

0.37 

43.07 

20000 

0.71 

0.72 

0.07 

0.01 

1.43 

0.80 

44.05 

50000 

1.97 

1.99 

0.18 

0.02 

3.96 

2.19 

44.69 

100000 

4.32 

4.37 

0.38 

0.05 

8l69 

4.80 

44.76 


Table 9: Pointer sorting on Sun 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif, 

% Saving 

10000 


1.15 


0.03 

2.19 


42.92 

20000 

2.41 

2.41 


0.07 

4.89 

HuuH 

46.01 

50000 

6.37 

6.38 



12.96 


45.83 

100000 


13.33 

0.89 

0.43 

mm 


45.76 


Table 10: Pointer Sort on 386/33 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Compare 

Total 

Basic 

Total 

Certif, 

% Saving 

10000 

0.86 

0.83 

0.14 

0.01 

1.73 

0.98 

43.35 

20000 

1.92 

1.87 

0.28 

0.01 

3.85 

2.16 

43.89 

50000 

5.32 

5.37 

0.69 

0.02 

10.64 

6.08 

4^85 


Table 11: Data structs on Sun 


16 


































Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif. 

Total 

Basic 

Total 

Certif. 

% Saving 

s 

0.075 

0.091 

0.026 


0.117 

28.7 

16 


0.248 

0.054 


0.302 

42.4 

32 


0.629 

0.111 

1.122 


51.6 

64 

1.330 

1.468 

0.224 

2.660 

1.692 

57.2 

128 

3.120 

3.398 

0.450 

6.240 

3.848 

62.2 

256 

7.225 

7.783 

0.903 

14.450 

8.686 

66.4 

512 

16.270 

17.388 

1.808 

32.540 

19.196 

69.5 


Table 12: Huffman Tree on 68000-based system 


Size 

Basic 

Algorithm 

Generate 

Certif. 

Use 

Certif, 

Total 

Basic 

Total 

Certif. 

% Saving 

Nodes 

Edges 

10 

15 

0.053 

0.054 

0.055 

0.106 

0.109 

-2.5 

10 

20 

0.071 

0.072 

0.073 

0.142 

0.145 

-1.7 

10 

25 

0.088 

0.089 

0.090 

0.176 

0.179 

-1.5 

50 

75 

0.320 

0.323 

0.309 

0.639 

0.632 

1.2 

50 

100 

0.423 

0.427 

0.400 

0.846 

0.826 

2.3 

50 

125 

0.492 

0.496 

0.464 

0.984 

0.960 

2.5 

100 

150 

0.652 

0.658 

0.602 

1.305 

1.260 

3.6 

100 

200 

0.874 

0.881 

0.789 

1.748 

1.671 

4.6 

100 

250 

1.036 

1.045 

0.938 

2.073 

1.983 

4.5 

500 

750 

3.588 

3.617 

3.047 

7.176 

6.664 

7.7 

500 

1000 

4.780 

4.817 

3.955 

9.560 

8.772 

9.0 

500 

1250 

5.656 

5.698 

4.717 

11.311 

10.415 

8.6 

1000 

1500 

7.474 

7.533 

6.115 

14.949 

13.649 

9.5 

1000 

2000 

9.902 

9.977 

7.919 

19.803 

17.895 

10.7 

1000 

2500 

11.830 

11.917 

9.517 

23.660 

21.434 

10.4 

1500 

2250 

11.415 

11.503 

9.157 

22.830 

20.660 

10.5 

1500 

3000 

14.967 

15.077 

11.802 

29.933 

26.879 

11.4 


Table 13: Min Spanning Tree on 68000-based system 


17 








of the Prim/Dijkstra algorithm [47, 18] as explicated in [54]. We provide a 
definition of the problem below. For more information on the graph theoretic 
terminology used in this problem and others the reader may consult [54, 17], 

Definition 5.2 Let G = (F, E) be a graph and let le be a positive rational 
valued function defined on E, A subtree of G is a tree, T(V',£'), with 

C V and E' C E, We say T spans and is spanned by T. If V' = V 
then we say T is a spanning tree of G. The weight of this tree is YleeE' ^(^)- 
A minimum spanning tree is a spanning tree of minimum weight. 

The problem is to input a graph with edge weights and output a mini- 
mum spanning tree. The algorithm for this problem which has the fastest 
asymptotic time complexity uses fusion trees and is given in [20]. This al- 
gorithm however appears to have a large constant of proportionality. Other 
asymptotically fast algorithms [22] also appear to be handicapped by large 
constants of proportionality. A fuller discussion of the two algorithms we 
employ for generation and use of a certification trial is given in [1]. 

5.5 Shortest Path Example 

This is another classic problem which has been examined extensively in the 
literature. Our approach is applied to a variant of the Dijkstra algorithm 
[18] as explicated in [54]. We are concerned with the single source problem, 
i.e., given a graph and a vertex s, find the shortest path from s to t; for 
every vertex v. 

The algorithm for this problem which has the fastest asymptotic time 
complexity uses fusion trees and is given in the same paper which we cited 
earlier when considering the minimum spanning tree problem[20]. This al- 
gorithm however appears to have a large constant of proportionality. Our 
solution employing the certification trail method is very closely based on the 
solution we gave for the minimum spanning tree problem [1]. 

5.6 Huffman Tree Example 

This is another old algorithmic problem and one of the original solutions 
w^as found by Huffman[30]. It has been used extensively to perform data 
compression through the design and use of so called Huffman codes. These 
codes are prefix codes which are based on the Huffman tree and which 
yield excellent data compression ratios. The tree structure and the code 
design are based on the frequencies of individual characters in the data to 


18 


be compressed. Here we are concerned exclusively with the Huffman tree. 
See [30] for information about the coding application. 

Definition 5.3 The Huffman tree problem is the following: Given a se- 
quence of frequencies (positive integers) /[l], /[2], . . . , /[n], construct a tree 
with n leaves and with one frequency value assigned to each leaf so that 
the weighted path length is minimized. Specifically, the tree should mini- 
mize the following sum: where LEAF is the set of leaves, 

len(t) is the length of the path from the root of the tree to the leaf /i, f[i] is 
the frequency assigned to the leaf 

The method we employ to generate and use a certification trail is detailed 
in the following technical report [2]. 

5.7 Sorting Example 

This important problem has a massive literature. In this section we will 
discuss how to apply the certification trail approach to the sorting problem. 
Let us assume that the sorting algorithm takes as input an array of n ele- 
ments and outputs an array of n elements. The algorithm is supposed to 
place the data into non-decreasing order. 

To design a certification trail algorithm we must discover the nature of 
the data that should be included in the certification trail to allow quick 
computation of the final output sorted array. Suppose that we decide to 
use the output array itself as the certification trail. We note that it is easy 
to check that this array is in non-decreasing order by simply performing a 
single peiss over the array. Unfortunately, it is considerably more difficult to 
make sure that this array contains exactly the same elements as the original 
input array. Indeed, this problem has a lower bound time complexity of 
fi(nlog(n)) in a comparison based model. 

Because of this difficulty we use the permutation of the elements defined 
by the input and output data arrays as the certification trail. To compute 
this permutation we allocate a new array of size n called permute which 
is initialized by setting its ith element to i. (Alternatively, we add a new 
field to pre-existing structures when structures are being sorted.) Each time 
the sort algorithm exchanges two elements the corresponding elements in 
the permute array are also exchanged. (If structures are being used then 
this happens automatically.) This approach works with all sort algorithms 
which are based on exchanging array elements. The code below shows how 


19 



the permute array is used to rapidly recompute the final sorted output array 
and how the permute array itself is checked. 

Algorithm SORT USING TRAIL 
Input: Arrays indata[l..n] and permute[l..n] 

Output: outdata[l.,n] containing the data in indata sorted into non-decreasing order 


The first part of the algorithm checks that the permute values are in the 
proper range and constructs the output array. 

1 FOR i := 1 to n DO 

2 IF permute[i] > n or permute[t] < 1 

3 THEN OUTPUT(‘‘Error: not a permutation”) STOP 

4 ELSE outdata[i] := indata[permute[z]] 

5 END FOR 


The next part of the algorithm checks that the output array is properly 
ordered. 

6 FOR i := 2 to n DO 

7 IF outdata[i — 1] > outdata[z] THEN OUTPUT(^‘Error: decreasing value”) STOP 

8 END FOR 


The final part of the algorithm checks that the permute array defines a 
proper permutation, i.e., each element is mapped to exactly one element. 

9 FOR i := 1 to n DO present[i] = FALSE END 

10 FOR i := 1 to n DO 

11 IF present[permute[i]] = TRUE 

12 THEN OUTPUT( “Error: not a permutation”) STOP 

13 ELSE present[permute[i]j ;= TRUE 

14 END FOR 

END SORT USING TRAIL 

Our experimental work on the Sun was based on a variant of quicksort 
[26] which is called quickersort [50]. The implementation of this algorithm 
that we used was provided by a Berkeley UNIX software distribution for 
the Sun. Our experimental work on the IBM PC was based on a quicksort 
algorithm implemented as part of a Gnu library of functions. 


20 



6 Answer- Validation Problem for Abstract Data 
Types 

The next few sections of this paper are concerned with the answer- validation 
problem for abstract data types. This kind of problem was originally pro- 
posed in [3] and provides a basis for applying the certification-trail method 
to wide classes of algorithms. Because of space limitations we wiU not discuss 
the details of how this can be done. 

Below, we define the answer-validation problem. Next, we give two ex- 
ample algorithms for the answer-validation problems. The first algorithm 
is for a priority queue which allows insert, min and deletemin operations. 
The second algorithm is for a priority queue which allows insert, min, delete 
and deletemin operations. In the next section experimental data on the 
execution times of these algorithms is presented. 

For each abstract data type we define an answer-validation problem. In- 
tuitively, the answer validation problem consists of checking the correctness 
of a sequence of supposed answers to a sequence of operations performed on 
the abstract data type. More formally, the input to the answer-validation 
problem is a sequence of operations on the abstract data type together with 
the arguments of each operation. In addition, the sequence contains the 
supposed answers for each of the operations which return answers. In par- 
ticular, each supposed answer is paired with the operation that is supposed 
to return it. Examples of such inputs are given in the columns labelled 
“Operation” and “Answer” table 15. 

The output for the answer-validation problem is the word “correct” if 
the answers given in the input match the answers that would be generated 
by actually performing the operations. The output is the word “incorrect” 
if the answers do not match. It is also useful to allow the output word to 
say ill-formed . This output is used if the sequence of operations is ill- 
formed, e.g., an operation has too many arguments or an argument refers 
to an inappropriate object. 

The answer- validation problem is similar to the idea of an acceptance 
test which is used in the recovery- block approach [48, 6] to software fault 
tolerance. The main difference is that an answ'er-validation problem is de- 
pendent upon a sequence of answers, not Just an individual answer. Hence, 
if an incorrect answer appears in the sequence, it may not be detected imme- 
diately. It is guaranteed, however, that an incorrect answer will be detected 
at some point during the processing of the entire sequence. By allowing 


21 


for this latency in detection, it is possible to create a much more efficient 
procedure for solving the answer-validation problem. 

The most important aspect of the answer-validation problem is that it 
is often possible to check the correctness of the answers to a sequence of 
operations much more quickly than actually calculating what the answers 
should be from scratch. In other words, the answer- validation problem has a 
smaller time complexity than the original abstract-data-type problem. This 
speedup is very useful in fault-detection applications. 

It is possible to run an answer-validation algorithm for some abstract 
data type concurrently with some algorithm which uses the abstract data 
type. The answer- validation algorithm could act as a monitor making sure 
that aU interactions with the abstract data type are handled correctly. This 
is valuable because many algorithms spend a large fraction of their time 
operating on abstract data types. Note, the overhead of this monitor is less 
than the overhead of actually performing the data-type operations a second 
time. 

7 Answer Validation for Priority Queue 

We will first consider the priority-queue abstract data type which allows 
only three operations: insert, min and deletemin. An example of a sequence 
of such operations appears in table 14. Many different data structures can 
be used to implement priority queues including heaps [61]; and balanced 
search trees such as AVL trees [5], red-black trees [27], or b-trees [13]. It 
is possible to process a sequence of 0{n) operations in 0(nlog(n)) time 
using the data structures above. Furthermore, there is a lower bound of 
fi(nlog(n)) because it is possible to sort using a priority queue. Remark- 
ably, the answer- validation problem can be solved using only 0{n) time, as 
documented below. 

The algorithm which we present in this section is the same as that given 
in [3]. It is necessary to include a description of this algorithm because the 
algorithm in the next section (which has not appeared before) builds on this 
algorithm. 

Each operation is time-stamped, i.e., the operations are assigned integers 
sequentially starting with 1 which is easy to do with a counter. The answer- 
validation algorithm uses a stack called answerstack. The contents of this 
stack are illustrated in table 14. The top of the stack is on the left in table 14. 

Let us consider the kinds of tests that an answer- valid at ion algorithm 


22 



Time Operation 

Answer 

Insert time 

1 

insert(6,300) 



2 

insert(2,404) 



3 

insert(3,250) 



4 

deletemin 

(3,250) 

3 

5 

insert( 10,248) 



6 

insert(12,245) 



7 

insert(4,260) 



8 

min 

(12,245) 

6 

9 

insert(13,140) 



10 

insert(5,142) 



11 

deletemin 

(13,140) 

9 

12 

deletemin 

(5,142) 

10 

13 

deletemin 

(12,245) 

6 

14 

deletemin 

(10,248) 

5 

15 

deletemin 

(4,260) 

7 


Stack used in validation 


(3,250,4) 


(12,245,8), (3,250,4) 


(13,140,11), (12,245,8), (3,250,4) 
(5,142,12), (12,245,8), (3,250,4) 

(12.245.13) ,(3,250,4) 

(10.248.14) ,(3,250,4) 
(4,260,15) 


Table 14: Sequence of Priority Queue operations illustrating answer valida- 
tion algorithm 


23 


for a priority queue might perform. Suppose (i,k) is the answer to some 
min or deletemin operation. Further, suppose (i',k') was the answer to a 
previous min or deletemin operation. If the priority queue is correct then 
either (i,k)>(i',k') or (i,k) was inserted after the answer (i^k^) was given. ** 
multiple insertions possible?* This suggests that the time of insertion for an 
element and the time of an answer should be recorded and the algorithm 
below does this. Unfortunately, if an algorithm compares an ordered pair 
which has been given as an answer against all previous answers then the 
algorithm complexity is at least O(m^). To avoid this a stack called the 
answerstack is used. The answerstack was designed to allow many compar- 
isons to be done implicitly and thus the overall complexity of the many tests 
is reduced. 

Algorithm for Answer Validation for Priority Queue 

Input: Sequence of m operations together with arguments and supposed 
answers for the priority-queue data type. 

Output: “correct”, “incorrect” or “ill-formed” 

Declarations: Array called inserttime indexed by item number. Array ele- 
ments contain either “absent” or a time-stamp. Array called keyvalue in- 
dexed by item number. Array elements contain either “absent” or a key 
value. Initially, each element in these two arrays contains “absent”. Stack 
of ordered triples called answerstack. Each ordered triple has the following 
form: first element is an item number, second element is a key value, and 
third element is a time-stamp, answerstack is initially empty. 

First phase: In this phase we process each operation as it appears serially 
using the following rules: 

Let currenttime refer to the time-stamp of the operation being processed. 

insert(i,k): If inserttime[i]^ “absent” then output “ill-formed” and stop. 
Otherwise, let inserttime[i] := currenttime and let keyvalue[i]=k. 

min (i^k): (where (i,k) is the supposed answer to the deletemin oper- 
ation.) If inserttime[i] = “absent” or keyvalue[i]/k then output “ill-formed” 
and stop. 

Otherwise, let (i',k') be the item number and key value of the triple on 
the top of answerstack (if there is one). Repeatedly pop the stack until 
(i,k)<(i',k') or until answerstack is empty. 

If answerstack is empty then push the triple (i,k, currenttime) onto an- 
swerstack and process the next priority queue operation. 


24 



If answerstack is non-empty then let the top element be answertime'). 

If inserttime[i]<answertime' then output “incorrect” and stop. Otherwise, 
push the triple (i,k, currenttime) onto answerstack and process the next pri- 
ority queue operation. 

deletemin (ijk): (where (i,k) is the supposed answer to the deletemin 
operation.) Perform the same actions as those described for the min opera- 
tion. However, just before processing the next priority queue operation, let 
inserttime[i]= “absent” and let key value[i]= “absent”. 

Second phase: In this phase we operate on the items which have been 
inserted but have never been deleted. 

Scan the array inserttime and for each item number for which inserttime[i] 7^ “absent” 
construct an ordered triple (i,keyvalue[i],inserttime[i]). Call this set of or- 
dered triples remainders. 

Use a bucket sort to sort the triples in remainders by their time-stamps, i.e., 
the third element of the ordered triple. 

Merge the triples in remainders together with the triples in answerstack so 
that they are all ordered by their time-stamps, i.e., the third element of the 
ordered triple. 

Scan the combined triples to determine if there exist two triples which satisfy 
the following: inserttime[i]<answertime' and (i,keyvalue[i])<(i',k'); where 
one triple is from remainders and has the form (i,keyvalue[i],inserttime[i]) 
and where the other triple is from answerstack and has the form (i',k', answertime'); 

If these two triples exist then output “incorrect” and stop. Otherwise output 
“correct” and stop. 


Theorem 7.1 The algorithm for answer validation of the priority queue 
abstract data type is correct. 

Theorem 7.2 The answer validation algorithm for priority queue has a 
time complexity of 0(n) for processing a sequence of 0(n) operations. 

For proofs of these theorems see [3]. 


25 


8 Answer Validation for Generalized Priority Queue 

We next consider the priority-queue abstract data type which allows four 
operations: insert, min, deletemin, and delete. An example of a sequence of 
such operations appears in table 15. 

The algorithm to solve the validation problem for this data type is an en- 
hanced version of the algorithm given above for the data type which allowed 
only three priority-queue operations. 

Algorithm for Answer Validation for Generalized Priority Queue 

Input: Sequence of m operations together with arguments and supposed 
answers for the priority-queue data type. 

Output: “correct”, “incorrect” or “ill-formed” 

Declarations: AU the declartions used in the earlier algorithm are used again. 

In addition, a collection of sets called stacksets a,ie used. Each set in stacksets 
consists of a set of item numbers (possibly the empty set). There is a one-to- 
one correspondence between the sets in stacksets and the ordered triples in 
answerstack. Initially, answerstack consists solely of the ordered triple (0,- 
00 ,- 1 ). Also initially, stacksets contains exactly one set w'hich is the empty 
set and which corresponds to (0,-oo,-l). 

First phase: In this phase we process each operation as it appears serially 
using the following rules: 

Let currenttime refer to the time-stamp of the operation being processed. 

insert (i,k): Perform the same actions as those given earlier for the insert 
operation. In addition, add the item number i to the set in stacksets corre- 
sponding to the top element in answerstack. 

min (ijk): (where (i,k) is the supposed answer to the deletemin opera- 
tion.) Perform the same actions as those given earlier for the min operation. 

In addition, if any elements are popped off of answerstack then the sets in 
stacksets corresponding to these elements are unioned together to form a 
new set. This new set is placed in correspondence with the new top element 
of answerstack. 

deletemin (^k): (where (i,k) is the supposed answer to the deletemin 
operation.) Perform the same actions as those given for the min opera- 
tion described immediately above. In addition, remove the item number 
i from the set in stacksets which contains it. Further, before processing 


26 


Time Operation 
1 insert(5,310) 


Answer Insert time 


Stack used in validation 


( 0 , - 00 ,- 1 ) 

{5} 

2 insert(6,210) (0,-oo,-l) 

{5,6} 

3 insert(8,280) (0,-oo,-l) 

(5,6,8) 


4 

min 

(6,210) 

2 



(6,210,4) 







{5,6,8} 

5 

insert(9,190) 





(6,210,4) 







{5,6,8,9} 

6 

min 

(9,190) 

5 


(9,190,6), 

(6.210,4) 







{5,6,8,9} 

7 

insert(2,275) 




(9,190,6), 

(6,210,4) 






{2}. 

(5, 6,8, 9} 

8 

delete(8) 


3 


(9,190,6), 

(6,210,4) 






{2}. 

{5,6,9} 

9 

insert(12,170) 




(9,190,6), 

(6,210.4) 






{2,12}, 

{5,6,9} 

10 

insert(14,400) 




(9,190,6), 

(6,210,4) 






{2,12,14}, 

{5,6,9} 

11 

deletemin 

(12,170) 

9 

(12,170,11), 

(9,190,6), 

(6,210,4) 






{2,14}, 

{5,6,9} 

12 

insert(3,290) 



(12,170,11), 

(9,190,6), 

(6,210,4) 





{3}. 

{2,14}, 

{5,6,9} 

13 

insert(7,330) 



(12,170,11), 

(9,190,6), 

(6,210,4) 





{3.7}, 

{2.14}, 

{5,6,9} 

14 

insert(15,200) 



(12,170,11), 

(9,190,6), 

(6,210,4) 





{3,7,15}, 

{2.14}, 

{5,6,9} 

15 

delete(9) 


5 

(12,170,11), 

(9,190,6), 

(6,210,4) 





{3,7,15}, 

{2,14}, 

{5,6} 

16 

deletemin 

(15,200) 

14 


(15,200,16) 

,(6,210,4) 






{2,3,7.14}, 

{5,6} 

17 

delete(7) 


13 


(15,200,16) 

.(6,210,4) 






{2,3,14}, 

{5,6} 

18 

deletemin 

(6,210) 

2 



(6,210,18) 







{2,3.5,14} 

19 

delete(14) 


10 



(6,210,18) 


{2,3,5} 


Table 15: Sequence of Priority Queue operations illustrating answer valida- 
tion algorithm 


27 


the next priority queue operation, let inserttime[i]= ‘‘absent'" and let key- 
value[i]= “absent”. 

delete(i): If inserttime[i] = “absent” or keyvalue[i] = “absent” then output 
“ill-formed” and stop. 

Otherwise, let inserttime=inserttime[i] and let k=keyvalue[i]. Next, let 
inserttime[i] = “absent” and let keyvalue[i] = “absent”. 

Now, let (i',k^, answertime') be the ordered triple which corresponds to 
the set in stacksets containing item number i. Next, remove item number i 
from the set which contains it. 

If answertime'>inserttime and (i,k)>(i',k') then output “incorrect” and 
stop. 

If answertime'>inserttime and (i,k)<(i',k') then process the next priority 
queue operation. 

If (i', k', answertime') is the top element of answerstack then process the 
next priority queue operation. 

Let (i",k", answertime") be the element immediately above (i',k', answertime') 
on answerstack. 

If (i,k)>(i",k") then output “incorrect” and stop. Otherwise, process the 
next priority queue operation. 

Second phase: In this phase we operate on the items which have been 
inserted but have never been deleted. 

For this phase one performs the same operations as the second phase de- 
scribed earlier. 


Theorem 8.1 The algorithm above for answer validation of the priority 
queue abstract data type is correct 

Theorem 8.2 The answer validation algorithm above for priority queue has 
a time complexity of 0(n) for processing a sequence of 0(n) operations. 

Proofs omitted for space reasons. It is clear that a priority queue with 
operations insert, delete, max, deletemax can also be validated in linear time 
by changing the appropriate signs in the algorithm above. 

Definition 8.3 Consider a sequence of priority queue operations together 
with arguments and supposed answers. The sequence may contain the 
following operations: insert, delete, min, deletemin, max, and deletemax. 


28 



Based on this sequence we define a new sequence called a minimum sequence. 
This sequence differs from the original sequence a^ follows: Each max op- 
eration and answer pair is removed from the sequence. Each deletemax 
operation and answer pair is replaced by a delete(i) operation where i is the 
item number given in the answer to the deletemax operation. Each other 
operation remains the same. 

We also define a maximum sequence. This sequence differs from the 
original sequence as follows: Each min operation and answer pair is removed 
from the sequence. Each deletemin operation and answer pair is replaced 
by a delete(i) operation where i is the item number given in the answer to 
the deletemin operation. Each other operation remains the same. 

Theorem 8.4 Consider a sequence of priority queue operations together 
with arguments and supposed answers. The sequence may contain the fol- 
lowing operations: insert, delete, min, deletemin, max, and deletemax. The 
answers given for this sequence are correct if and only if the answers given 
for the corresponding minimum and maximum sequences are both correct. 

This theorem allows us to define an algorithm which solves the answer- 
validation problem for general priority queue. 

9 Probabilistic Model 

We will now present a simple probabilistic model with accompanying analy- 
sis which will permit a comparison between of our certification-trail method 
and the classical time- redundancy approach [32, 52]. The analysis shows 
that when the certification-trail method has a smaller execution time than 
the time-redundancy approach it yields strictly superior performance. This 
means the certification trail method has both a a smaller probability of er- 
ror and a smaller probability of undetected error. Surprisingly, the analysis 
also reveals the intriguing result that the certification-trail method often can 
display superior performance even when the method has the same execution 
time or a longer execution time than the time-redundancy approach. This 
superior behavior stems from the typical assymetry of the execution times 
of the first and second executions in the certification-trail method. 

We make the following assumptions. 

i. Errors are distributed exponentially with parameter A. 


29 



ii. If errors occur during only one phase of the execution, then they are 
detected. 

iii. If errors occur in both phases of an execution they are not detected. 
For solutions to a problem with run times a and b, we therefore have: 


Pr{correct} 

Pr{d€tected} 

Pr {undetected} 


A(a-f-6) 

e~^“(l - 

_|_ ^-A6 _ 

1 — -b 

1 - Pr {correct} - Pr {detected} 


Given two solutions for a problem, we say that the first is strictly superior 
to the second iff: 


Pr\{correct} > Pr 2 {correct} and Pr^ {undetected} < Pr 2 {undetected} 

or 

Pri {correct} > Pr 2 {correct} and Pri {undetected} < Pr 2 {undetected} 

This implies that the run time of the first solution is no greater than 
that of the second solution. 

Observation 1 Suppose there are two solutions (using certification trails) 
to a problenij such that each solution runs in two phases^ and the combined 
run times of phases is the same for both solutions. Then the solution with 
the greater time imbalance between phases is strictly superior. 

Proof: Let 2a = the run time . Let a + 6 the run length of the first 
phase of the first method, and a + c be the run time of the first phase of 
the second method. Then the second phases have times of a - 6 and a - c 
respectively. Assume b < c. 

Since the total run time is the same for both solutions, we have Pri {correct} = 

Pr 2 {correct} = only show that Prj {t/efecied} < Pt-j { detected), 

ie. ’ 


30 


^-A(a+6) ^-A(a-6) ^ ^-A(a+c) ^ g-A(a-c) 

g-A6 ^ ^A6 ^ ^-Ac ^ ^Ac 

Setting X = and y = we want 

< 2/+~ foT 1 < X < y 

y 

< y - X 

< y - X 


1 

X + - 
X 

l_l 
X y 
y - X 
xy 


Corollary 1 Given a basic algorithm for a problem, a certification trail 
method is superior to running the basic algorithm twice if the total run time 
is no greater than twice that of the basic algorithm. 

The above statements apply to the situation of a single execution of a 
solution. A more interesting caise is to iterate the solution until no errors are 
reported, that is we either arrive at the correct answer, or have undetected 
errors. 

Let Pra^j. {correct) denote the probability of finding a correct solution 
in the iterated scheme and Pra^r [undetected) denote the probability of 
accepting an incorrect run. 

Note that we repeat a run only when errors are detected, so if we obtain 
the correct answer on the n — th run, the previous n — 1 runs must have 
resulted in detected errors. Thus it is clear that: 


Pr iter [cor red) 


oo 

Pr [correct) ^ Pr [detected) 
i=o 


Pr[correct) 

1 — Pr[detected) 


Similarly, 


Pr iter [undetected) = 


Vi [undetected) 

1 — Pr[detected) 


31 


For the iterated scheme, we wiU say that one method is superior to 
another if the probability of obtaining the correct answer is larger. Obviously 
if a method is superior in the single run sense, it must be superior in the 
iterated case. However it is possible for one method to be superior to another 
in the Iterated scheme, but not in the single run scheme. This means that 
a certification trail method may be better than running a basic algorithm 
twice, even if the certification trail takes longer to run! 

Suppose we have a basic algorithm A with running time a for a particular 
problem, and a certification trail method with phases running in times 6 and 
c. Given b, how small must c be, for the certification trail to be superior? 

We require: 


Pr cert {correct} 

1 - Prcert{detected} ^ 

e-A(i.+c) 

1 - e-A6 _ e-Ac + ^ 

g-Ac(g-A6 g-A2a _ > 

Note that b > a, so 


Prba,ic{correct)l - Pri,asic{detected} 
g-A2a 

1 - 2e--'“ + 2e--'2a 
g-A2o _ g-A(2a+6) _ g-A(2^ ^ 

— must be positive. So, 


^-Ac > 

g-Xb ^ e-Aa^i _ g-A6) 

, 1, _g-A6x 

A e~^'> + €-^^<^(1 - 

Since the argument to In is strictly between 0 and 1, c is weU defined for 
any choice of a, 6, and A. 

In addition to the probability of correctness, we would like to know the 
expected running time using the iterated approach. Fortunately, this is 
easily determined. 

Our probability of stopping on a particular execution is Pr{correct} + 
Pr{undetected} = 1 - Pr {detected}. Therefore with that probability we 
stop on the first execution, with probability Pr{detected}{l- Pr {detected}) 
we stop on the second execution, and in general we stop on the nth execution 
with probability (1 - Pr{detected})(Pr{detected})^-\ This gives us an 
expected number of iterations of. 


32 


oo 

(1 — Pr{detected})'^2i^ + l)Pr {detected}^ 
i=o 

Now, 

oo , 

SO we find that the expected number of iterations is, 

1 

1 — Pr{detect€d} 

Multiplying the run time of a single iteration will give us the expected 
running time. 

Table 16 shows information for running a basic algorithm. The run time 
of a basic algorithm is set to 1 unit of time. The basic algorithm is run 
twice and the results compared, we assume that comparator is fast enough 
so that the time it takes is negligible (this is justified by the experimental 
results), and that it is error free. We compute 

i. Prob. Correct - The probabiUty that both phases are error free, 

ii. Prob. Detected - The probabibty that exactly on of the phases contains 
an error. 

iii. Prob. Undetected - The probabibty that both of the phases contain 
errors. 

iv. Iterated Prob Correct - If the basic algorithm is iterated (each itera- 
tion is two runs), this is the probabibty that the terminating result is 
correct . 

V. Expected Runtime - The expected run time of the algorithm in the 
iterated model. For the basic algorithm this is twice the expected 
number of iterations. 

Tabel 17 illustrates the “breakeven” point for the certification trail ap- 
proach. Given a value for A and a run time 6 of a trail generating algorithm. 
The breakeven point for the run time of the trail checking algorithm is the 


33 


A 

Basic 

Algorithm 

Prob 

Correct 

Prob. 

Detected 

Prob, 

Undetected 

Iter. 

Prob. 

Correct 

Expected 

Runtime 

0.01 

1 

0.980199 

0.019702 

0.000099 

0.999899 

2.040197 

0.10 

1 

0.818731 

0.172213 

0.009056 

0.989060 

2.416081 

1.00 

1 

0.135335 

’ 0.465088 

0.399576 

0.253005 

3.738935 


Table 16: Balanced Probabilites 


A 

Generate Trail 

Breakeven Trail Checker 

0.01 

1.10 

0.909050 

0.01 

1.50 

0.666111 

0.01 

2.00 

0.498750 

0.10 

1.10 

0.908683 

0.10 

1.50 

0.661128 

0.10 

2.00 

0.487505 

1.00 

1.10 

0.905504 

1.00 

1.50 

0.614107 

1.00 

2.00 

0.379885 


Table 17: Certification checker breakeven points 


point at which the iterated probability of correctness is the same as for the 
^^basic” algorithm (which has a run time of 1). 

Run times less than this will result in the certification trail solution being 
superior. It is interesting to notice that in the total length of the solution at 
the breakeven point is greater than 2, ie. running the basic algorithm twice. 

Table 18 is similar to the first one, the difference being that this examines 
the behavior of certification trail methods for different run times of the two 
phases. The meaning of the other columns is identical to the meaning in the 
table for basic algorithms. Of interest is the row A = 1.00, b = 1.50, c = 0.25. 
Compare this with the first table for A = 1.00. We see that the certification 
method has a greater probabiUty of being correct for a single run and the 
total run time is shorter than twice the basic algorithm, yet the expected 
iterated run time is larger! 


10 Fault Injection Experiments 

A series of hardware fault injection experiments have been conducted during 
which combinations of the address, data, and control lines of a Motorola 




0.01 


Generate 

Certif. 


1.10 


Use 

Certif. 


0.25 


Prob 

Correct 


0.986591 


0.01 


1.10 


0.50 


0.984127 


0.01 


1.10 


0.75 


0.981670 


0.01 


0.01 


0.01 


1.50 


1.50 


1.50 


0.2 5 

0.50* 


0.982652 


0.980199 


0.75 


0.977751 


0.01 


2.00 


0.25 


0.977751 


0.01 


0.01 


0.10 


0.10 


2.00 


2.00 


1.10 


1.10 


0.50 


0.75 

0.25 


0.975310 

0.972875 

0.873716 


0.50 


0.852144 


0.10 


0.10 


1.10 


1.50 


0.75 


0.831104 


0.25 


0.839457 


0.10 


0.10 


1.50 


1.50 


0.50 

0.75 


0.818731 


0.798516 


0.10 


2.00 


0.25 


0.798516 


0.10 


2.00 


0.50 


0.778801 


0.10 


1.00 


2.00 


1.10 


0.75 


0.759572 


0.25 


0.259240 


1,00 


1.10 


0.50 


0.201897 


1.00 


1.00 

1.00 


1.10 


1.50 

1.50 


0.75 


0.157237 


0.25 


0.173774 


0.50 


0.135335 


1.00 


1.00 


1.50 


2.00 


2.00 


2.00 


0.75 


0.105399 


0.25 


0.105399 


0.50 

0.75 


0.082085 


0.063928 


Prob. 


Prob. 


Detected 


Undetected 


0.013382 

0.015818 

0.018248 

0.017311 

0.019727 

0.022138 

0.022199 

0.024591 

0.026977 

0.123712 

0.142776 

0.161369' 

0.157104 

0.174476 

Qd91419 

0.197008 

0.212359 

0.227330 

0.593191 

0.53560 9 

0.490763 

0.654383 

0.558990 

0.484698 

0.703338' 

0.577696 

0.479846 


0.000027 

0.000055 

0.000082 

0.000037 

0.000074 

0.000111 

0.000049 

0.000099 

0.000148 

0.002572 

0.005080 

0.007527 

0.003439 

0.006793 

0.010065 

0.004476 

0.008841 

0.013098 

0.147568 

0.262495 

0.352000 

0.171843 

0.305674 

0.409903 

0.191263 

0.340219 

0.456226 


Iter. Expected 

Prob. Runtime 


Correct 

0.99997^ 

0.999945 

0.999917 

0.999962 

0.999924 

0.999886 

0.999949 

0.999899 

0.999848 

0.997065 

0.994074 

0.991025 

0.995920 

0.991771 

0.987553 

0.994426 

0.988776 

0.983049 

0.637254 

0.434755 

0.308770 

0.502793 

0,306876 

0.204539 

0.355283 

0.194374 

0.122902 


1.368311 

1.625716 

1.884387 

1.780827 

2.040248 

2.300937 

2.301082 

2.563028 

2.826245 

1.540590 

1.866490 

2.205976 

2.076175 

*2.422703 

2.782653 

2.802021 

3.174033 

3.559087 

3.318513 

3.445370 

3.632888 

5.063409 

4.535047 

4.366374 

>.584379 

5.919905 

5.286897 


Table 18: Unbalanced Probabilites 


35 


ii 




















































































iiEii: 


M68000-based target system were pulsed with selected signals of various 
types and durations while in the process of executing algorithms. In addition 
to the MC68000 microprocessor which served as the cpu, the target also was 
comprised of 512K bytes of RAM, 512 bytes of ROM, and numerous I/O 
modules to support serial and parallel communication. A timer module is 
also included in the target which uses the 4Mhz clock as a reference so as 
to provide execution time data for experiments. Finally, a simple operating 
system is resident in the ROM of the target which provides programming 
and operational support. 

The fault injection testbed on which these experiments were performed is 
illustrated as the configuration shown in Figure 3. In addition to the target 
system, the fault injection testbed contains other modules which perform 
the fault injection and data acquisition functions under instruction from 
the Operations Control Console. By means of RS232C, SCSI, and GPIB 
interfaces, a Macintosh IICX serves as the Operations Control Console per- 
mitting fault injections to be precisely executed and resulting error data to 
be recorded for later analysis by a SUN SPARCstation 2. 

The Operations Control Console also communicates over a VMEbus with 
the Testbed Controller which is responsible for overall testbed operation. 
The primary component of the Testbed Controller is a MC68030-based unit 
with 8 Mbytes of SRAM to store error data from fault injection runs as 
communicated to it over the VMEbus from the data acquisition module. 
The Testbed Controller also is similarly responsible for the operations of 
the fault injection module as determined by commands from the Operations 
Control Console, 

The fault injection module and the data acquisition module have access 
via edge connector pins to the fines of the target system selected for injection 
and monitoring, respectively. The fault injections are precisely triggered af- 
ter some operator determined delay following the appearance of an operator 
pre-selected set of bits on either the address fines of the address bus or the 
data fines of the data bus. Similarly, the durations and frequencies of the 
injections are also controlled by the operator. The injections emanate from 
a bank of programmable function generators included in the fault injection 
module. The precision with which fault conditions are triggered and injected 
permits the resulting error conditions which are observed to be repeated (if 
necessary) for further monitoring/analysis. The data acquisition module is 
also triggered by the same address or data bits that activated the fault injec- 
tion module. However, there is no delay associated with the data acquisition 
function; transfer of the signals on the lines being monitored by the data 


36 


acquisition module to the memory of the Testbed Controller commences 
immediately the data acquisition module’s activation. Data monitored by 
the data acquisition module is transmitted directly onto VME bus and then 
written into the SRAM of the Testbed Controller. 

10.1 Fault injection and error classification in MC68000 tar- 
get system 

To generally indicate the details of the fault injection experiments using the 
target system, the injections and resulting errors can be summarized and 
displayed at the Operations Control Console as illustrated in Figure 4. 

In the example illustrated in Figure 4, the trigger address for the injection 
was selected by the operator to be address 1019F (hexadecimal) in the first 
version of Huffman tree program which was to generate both the output 
and the certification trail. The actual injection consisted of holding the 
lower 4 bits of the data bus at logical zero starting 2 microseconds after 
the recognition of the trigger address by the fault injection module and 
then maintaining the logical zero on these lines for various durations lasting 
between 1 and 10 microseconds. For this example, we see that 5 distinct 
error conditions resulted depending on the duration of the injection. The 
details of data errors classified as type 2 and type 3 are beyond the scope of 
this discussion. Suffice it to say that each such type of data error observed 
in this particular experimental run could be interpreted as an inconsistent 
labeling of nodes in the certification trail passed to the second program. In 
each case, however, it should be emphasized that the execution of the second 
program utilizing the certification trail detected the error. The other errors 
listed in Figure 4 can be categorized as address errors and illegal instructions. 

Our purpose in presenting Figure 4 is only to illustrate an example of 
a fault injection run with a subsequent error analysis and classification. In 
general, the errors resulting from injections into the target system could be 
classified as: 

• No error. 

• Data output errors 

• Certification trail errors 

• Addressing errors 

• Data value errors 


37 


Testbed 

Controller 



Figure 3: Hardware fault injection testbed for MC68000-based target system 











!!'! 


Fault 

L XXXX XXXO 


Delay Width Error 


0 us 


.1 us 
.2 
.3 
.4 
.5 
1 
2 

4 

4.5 

5 

5.5 

6 

7 

8 

9 

10 


no error 
no error 
no error 
ADDR TRAP ERROR 

addr trap error 
addr trap error 

ADDR TRAP ERROR 

addr trap error 
addr trap error 
data^error .2 
Certification Error: 
aata^error .2 
Certification Error: 
aata_error.3 
Certification Error: 
aata_error.3 
Certification Error: 
data_error.3 
Certification Error: 
aata_error.3 
Certification Error: 
ILLEGAL instruction' 


Inconsistent 

Inconsistent 

Inconsistent 

Inconsistent 

Inconsistent 

Inconsistent 


Labels 

Labels 

Labels 

Labels 

Labels 

Labels 


Figure 4: Example of output displayed at Operations Control Console for 
fault injection run for Huffman tree algorithm program 


38 


pni 

n 



• Halt generated 

• Reset generated 

• Non-termination of program 

• Program mutilation 

Currently, the testbed tools are being expanded to produce automated 
injections using suites of fault conditions on the target system. 

Software fault injection experiments were also performed in which in- 
structions, data, and stack contents were modified using both the Sun Spare- 
station and the 386 machine with which the previously detailed timing data 
was collected. The details of these fault injection experiments will be pre- 
sented in a companion document. 

11 Concluding Discussion 

This paper experimentally supplements two previous FTCS papers [1, ?] 
which theoretically explore the new fault tolerance technique referred to as 
the certification trail method. We have presented experimental timing data 
which illustrates the advantages of the certification trail technique over clas- 
sical time redundancy. We have further presented analytical results which 
further support the significance of the certfication trail technique. 


References 

[1] Sullivan, G.F., and Masson, G.M., “Using certification trails to achieve 
software fault tolerance,” Digest of the 1990 Fault Tolerant Computing 
Symposium^ pp. 423-431, IEEE Computer Society Press, 1990. 

[2] Sullivan, G.F., and Masson, G.M., “Using certification trails to achieve 
software fault tolerance,” Department of Computer Science Technical 
Report JHU 89/26, Johns Hopkins University, Baltimore, Maryland, 
1989. 

[3] SuUivan, G.F., and Masson, G.M., “Certification trails for data struc- 
tures,” Digest of the 1991 Fault Tolerant Computing Symposium, pp. 
240-247, IEEE Computer Society Press, 1991. 


40 


PRSCfDm MGE BIANK NOT FILMED 



[4] Sullivan, G.F., and Masson, G.M., “Certification trails for data struc- 
tures,” Department of Computer Science Technical Report JHU 90/17^ 
Johns Hopkins University, Baltimore, Maryland, 1990. 

[5] Aderson-Vel’skii, G. M., and Landis, E. M., “An algorithm for the or- 
ganization of information”, Soviet Math. DokL, pp. 1259-1262,3, 1962. 

[6] Anderson, T., and Lee, P., Fault tolerance: principles and practices^ 
Prentice- Hall, Englewood Cliffs, NJ, 1981. 

[7] Andrews, D., “Software fault tolerance through executable assertions,” 
Rec. 12th Asilomar Conf. Circuits, Syst., Comput., pp. 641-645, 1978, 
Nov. 6-8. 

[8] Andrews, D., “Using excutable assertions for testing and fault toler- 
ance,” Dig. 9th Annu. Int. Symp. Fault Tolerant Comput., pp. 102-105, 
1979, June 20-22. 

[9] Avizienis, A., “Fault tolerance by means of external monitoring of com- 
puter systems,” Proceedings of the 1981 National Computer Conference, 
pp. 27-40, AFIPS Press, 1980 

[10] Avizienis, A., “Design diversity - the challenge of the eighties,” Digest 
of the 1982 Fault Tolerant Computing Symposium, pp. 44-45, IEEE 
Computer Society Press, 1982, 

[11] Avizienis, A,, and Kelly, J., “Fault tolerance by design diversity: con- 
cepts and experiments,” Computer, vol. 17, pp. 67-80, Aug., 1984. 

[12] Avizienis, A., “The N-version approach to fault tolerant software,” 
IEEE Trans, on Software Engineering, vol. 11, pp. 1491-1501, Dec., 
1985. 

[13] Bayer, R., and McCreight, E., “Organization of large ordered indexes”, 
Acta Inform., pp 173-189, 1, 1972. 

[14] Blough, D., and Masson, G., “Performance analysis of a generalized 
concurrent error detection procedure,” IEEE Trans, on Computers vol. 
39, Jan., 1990. 

[15] Blum, M., and Kannan, S., “Designing programs that check their 
work”, Proceedings of the 1989 ACM Symposium on Theory of Com- 
puting, pp. 86-97, ACM Press, 1989. 


41 



[16] Chen, L., and Avizienis A., “N-version programming: a fault toler- 
ant approach to reliability of software operation,” Digest of the 1978 
Fault Tolerant Computing Symposium^ pp. 3-9, IEEE Computer Society 
Press, 1978. 

[17] Cormen, T. H., and Leiserson, C, E., and Rivest, R. L., Introduction to 
Algorithms McGraw-Hill, New York, NY, 1990. 

[18] Dijkstra, E. W., “A note on two problems in connexion with graphs,” 
Numer. Math, f pp. 269-271, Sept., 1959. 

[19] Eifert, J.B., and Shen, J.P., “Processor monitoring using asynchronous 
signatured instruction streams,” Dig. 14 th Int. Conf. Fault-Tolerant 
Comput.., pp. 394-399, 1984, June 20-22. 

[20] Fredman, M. L., and Willard, D. E., “Trans-dichotomous algorithms for 
minimum spanning trees and shortest paths,” Proc. 31st IEEE Foun- 
dations of Computer Science., pp. 719-725,1990. 

[21] Fredman, M. L., and Saks, M. E., “The ceU probe complexity of dy- 
namic data structures,” Proc. 21st ACM Symp. on Theo. Comp. 1989^ 
pp. 109-122, 2, 1986. 

[22] Gabow, H. N., Galil, Z., Spencer, T., and Tarjan, R. E., “Efficient algo- 
rithms for finding minimum spanning trees in undirected and directed 
graphs,” Combinatorica pp. 109-122, 2, 1986. 

[23] Gabow, H. N., and Tarjan, R. E., “A linear-time algorithm for a special 
case of disjoint set union,” J. of Comp, and Sys. Sci., 30(2), pp. 209- 
221, 1985. 

[24] Graham, R. L., “An efficient algorithm for determining the convex hull 
of a planar set”, Information Processing Letters^ pp. 132-133, 1, 1972. 

[25] Graham, R. L., and Hell, P., “On the history of the minimum spanning 
tree problem,” Ann. Hist. Comput.., pp. 43-47, Jan., 1985. 

[26] Hoare, C. A. R., “Quicksort,” Computer Journal^ pp. 10-15, 5(1), 1962. 

[27] Guibas, L. J., and Sedgewick, R., “A dichromatic framework for bal- 
anced trees”, Proceedings of the Nineteenth Annual Symposium on 
Foundations of Computing^ pp. 8-21, IEEE Computer Society Press, 
1978. 


42 


[28] Gunneflo, U., Karlsson, J., and Torin, J., ‘‘Evaluation of error detection 
schemes for using fault injection by heavy-ion radiation/’ Dig. of the 
1989 Fault Tolerant Computing Symposium^ pp. 340-347, June, 1989. 

[29] Huang, K.-H., and Abraham, J., “Algorithm-based fault tolerance for 
matrix operations,” IEEE Trans, on Computers, pp, 518-529, voL C-33, 
June, 1984, 

[30] Huffman, D,, “A method for the construction of minimum redundancy 
codes”, Proc. IRE^ pp 1098-1101, 40, 1952. 

[31] Iyengar, V.S. and Kinney, L.L., “Concurrent fault detection in micro- 
programmed control units,” IEEE Trans. Comput., vol. C-34, pp. 810- 
821, Sept, 1985, 

[32] Johnson, B., Design and analysis of fault tolerant digital systems 
Addison-Wesley, Reading, MA, 1989. 

[33] “Fault tolerant FFT networks,” Dig. of the 1985 Fault Tolerant Com- 
puting Symposium^ June, 1985, 

[34] Kane, J.R. and Yau, S.S., “Concurrent software fault detection,” IEEE 
Trans. Software Eng. , vol. SE-1, pp. 87-99, March 1975. 

[35] Komlos, J., “Linear verification for spanning trees”. Proceedings of the 
1984 Symposium on Foundations of Computing., pp. 201-206, IEEE 
Computer Society Press, 1984. 

[36] Lee, Y.H. and Shin, K.G,, “Design and evaluation of a fault-tolerant 
multiprocessor using hardware recovery blocks,” IEEE Trans. Comput.^ 
vol. C-33, pp. 113-124, Feb. 1984. 

[37] Lu, D., “Watchdog processor and structural integrity checking,” IEEE 
Trans. Comput., vol, C-31, pp. 681-685, July 1982. 

[38] Mahmood, A., Lu, D.J. and McCluskey, E.J., “Concurrent fault detec- 
tion using a watchdog processor and assertions,” Proc. 1983 Int. Test 
Conf.,, pp. 622-628, Oct., 1983. 

[39] Mahmood, A. Ersoz, A. and McCluskey, E.J., “Concurrent system level 
error detection using a watchdog processor,” Proc. 1985 Int. Test Conf.^ 
pp. 145-152, Nov., 1985. 


43 



[40] Mahmood, A., and McCluskey, E,, “Concurrent error detection using 
watchdog processors - a survey,” IEEE Trans, on Computers^ vol. 37, 
pp. 160-174, Feb., 1988. 

[41] Mahmood, A., and McCluskey, E., “Concurrent error detection using 
watchdog processors”, IEEE Trans, on Computers^ vol. 37, pp. 160-174, 
Feb., 1988. 

[42] Nair, V., and Abraham, J., “General linear codes for fault-tolerant 
matrix operations on processor arrays,” Dig. of the 1988 Fault Tolerant 
Computing Symposium., pp. 180-185, June, 1988. 

[43] Namjoo, M., and McCluskey, E., “Watchdog processors and capability 
checking,” Digest of the 1982 Fault Tolerant Computing Symposium., 
pp. 245-248, IEEE Computer Society Press, 1982, 

[44] Namjoo, M. “Techniques for concurrent testing of VLSI processor op- 
eration,” Dig. 1982 Int. Test Conf., pp. 461-468, Nov., 1982. 

[45] Namjoo, M. “CERBERUS-16: An architecture for a general purpose 
watchdog processor,” Dig. Papers 13th Annu. Int. Symp. Fault Tolerant 
Comput., pp. 216-219, June, 1983. 

[46] Preparata F. P., and Shamos M. I., Computational geometry: an intro- 
duction., Springer- Verlag, New York, NY, 1985. 

[47] Prim, R. C., “Shortest connection networks and some generalizations,” 
Bell SysL Tech. J., pp. 1389-1401, Nov., 1957. 

[48] Randell, B., “System structure for software fault tolerance,” IEEE 
Trans, on Software Engineering., vol. 1, pp. 220-232, June, 1975. 

[49] Schmid, M., Trapp, R., Davidoff, A., and Masson, G., “Upset exposure 
by means of abstraction verefication,” Dig. of the 1982 Fault Tolerant 
Computing Symposium., pp. 237-244, June, 1982. 

[50] Sedgewick, R,, “Implementing quicksort programs,” Communications 
of the ACM, pp. 847-857, 21(10), 1978. 

[51] Shen, J.P. and Schuette, M.A., “On-line self-monitoring using signa- 
tured instruction streams,” Proc. 1983 Int. Test Conf.,, pp. 275-282, 
Oct., 1983. 


44 



[52] Siewiorek, D., and Swarz, R., The theory and practice of reliable design, 
Digital Press, Bedford, MA, 1982. 

[53] Sridhar, T. and Thatte, S.M., “Concurrent checking of program flow in 
VLSI processors,” Dig. 1982 Int. Test Conf, pp. 191-199, Nov., 1982. 

[54] Tarjan, R. E., Data Structures and Network Algorithms, Society for 
Industrial and Applied Mathematics, Philadelphia, PA, 1983. 

[55] Tarjan, R. E., “Efficiency of a good but not linear set union algorithm,” 
J. ACM, 22(2), pp. 215-225, 1975. 

[56] Tarjan, R. E., “A class of algorithms which require nonlinear time to 
maintain disjoint sets,” J. of Comp, and Sys. Sci., 18(2), pp. 110-127, 

1979. 

[57] Tarjan, R. E., and Leeuwen, J. van, “Worst-case analysis of set union 
algorithms,” J. ACM, 31(2), pp. 245-281, 1984. 

[58] Tarjan, R. E., “Applications of path compression on balanced trees”, 
J. ACM, pp. 690-715, Oct., 1979. 

[59] Tomas, S. P. and Shen, J. P., “A roving monitoring processor for detec- 
tion of control flow errors in multiple processor systems,” Proc. IEEE 
Int. Conf Comput. Design: VLSI Comput., pp. 531-539, Oct., 1985. 

[60] Taylor, D., “Error Models for robust data structures,” Dig. 20th Annu. 
Int. Symp. Fault Tolerant Comput., pp. 416-422, 1990 June 26-28. 

[61] Williams, J. W. J, “Algorithm 232 (heapsort),” Commun. of ACM, 
vol.7, pp. 347-348, 1964. 

[62] Yau, S.S, and Chen, F.-C., “An approach to concurrent control flow 
checking,” IEEE Trans. Software Eng., vol. SE-6, pp. 126-137, March 

1980. 


45 


appendix a 


DATA ACQUISITION 
MODULE 

TECHNICAL MANUAL 
Ver. 1.0 


Illlll ! 




THE TABLE OF CONTENTS 


1. The Experimental System Overview 

1.1 System Configuration 

1.2 General System Description 

1.3 System Customization 

2. Data Acquisition Module 

2.1 Hardware Overview 

2.2 Clock Control 

2.3 Address Generator 

2.4 Address Bus Buffers and Address Modifier Selector 

2.5 Data Transfer Control 

2.0 Input Channel Selector and Data Bus Buffers 
2.7 VMEbus Master Control 

3. Interface Signals 

3.1 VMEbus Interface 

3.2 Input Channels 

Appendix A Schematic Diagrams 
Appendix B Parts List 
Appendix C DAM Board Layout 
Appendix D Copies of Data Sheets 




1. The Experimental System Overview 

This system provides an experimental environment for recording and ana- 
lyzing upset data in computer systems. This chapter provides the information 
on the system configuration and general hardware description. 


1.1 System Configuration 

This experimental system is mainly based on the VMEbus and controlled 
by the 68030 CPU board. The VMEbus provides a master-slave, asyn- 
chronous non- multiplexed data transfer medium. The target system (CPU 
Under Test) and the Fault Injection Module are connected by its local bus. 

Fig.1.1 shows the experimental configuration. This system’s features in- 
clude: 

• 68030 CPU Board 

• Up to 8 Mbyte SRAM Memory Modules 

• Floppy Disk and SCSI Bus Controller (FDC/SCSI) 

• 80 Mbyte Hard Disk and 3.5” Floppy Disk Drive 

• OS-9 Operating System 

• Chassis with power supply, cooling fans, and motherboard 

• Data Acquisition Module 

• CPU Under Test (MC68000 Educational Computer Board) 

• Fault Injection Module 

• (GP-IB I/F ControUer) 

• (SUN SPARCstation) 



o 


LAN 



DAM: Data Acquisition Module 
CUT: CPU Under Test (Target Systeu) 
FIM: Fault Injection Module 


Fig. 1.1 Experinental Configuration 












1.2 General System Description 

This section briefly describes the general description of each module of the 
experimental system. For detailed information, refer to the user’s manuals 
on specific modules. 

• 68030 CPU Board 

- SYS68K/CPU-33XN (Force Computers Inc.) 

- 68030 CPU with 16.7 MHz clock frequency. 

— Not equipped with the Floating Point Coprocessor. 

- 32-bit high speed DMA controller for data transfers. 

- 1 Mbyte of shared dynamic RAM. 

— Two multiprotocol serial I/O channels. 

- Up to 2 Mbyte EPROM and up to 512 Kbyte SRAM/EEPROM. 

- Real Time Clock with calendar and on-board battery backup. 

- Full 32 bit VMEbus master/slave interface. 

• Memory Module 

- SYS68K/SRAM-6 (Force Computers Inc.) 

- 2 Mbyte SRAM on SRAM-6. 

— Battery backup for SRAM devices. 

- 55ns(typical) Read/Write Access Time. 

— Jumper selectable access address and address modifier code. 

- VMEbus intereface supporting 32 data and 32 address lines. 

• Floppy Disk and SCSI Bus Controller 

- SYS68K/ISCSI-1 (Force Computers Inc.) 

- 68010 CPU for local control. 

- 68450 DMA Controller for local transfers. 

— SCSI bus interface with the NCR5386S SCSI bus controller. 



— SHUGART compatible floppy interface with the WD1772 FDC. 
— All I/O signals available on P2 connector. 

— VMEbus interface supporting A24:D16, D8. 

• Mass Storage Module 

— SYS68K/MSM-84 (Force Computers Inc.) 

— Only VME PI backplane is required. 

— 64 Pin flat cable is used to connect P2 of the ISCSI-1. 

— Floppy Disk Driver (Toshiba ND352) 

* Disk Size and Capacity: 3.5”, 1.0 Mbyte 

* Number of Tracks: 160 

* Access Time: 79 ms (average) 

— Hard Disk (Quantum PRO80S) 

* Disk Size and Capacity: 3.5”, 84 Mbyte 

* Number of Cylinders and Heads: 834, 6 

* Seek Time: 19 ms (average) 

• OS-9 Operating System 

— ProfessioneJ OS-9 (Microware Systems Corporation) 

— Multitasking, reed time operating system. 

— UNIX- like shell and a hierarchical directory /file structure. 

— C Compiler, Assembler/Linker, and User-state Debugger. 

— /iMACS screen-oriented text editor. 

• Chassis with power supply, cooling fans, and motherboard 

— SYS68K/TARGET-32 (Force Computers Inc.) 

— 19”, 7U chassis. 

— 500 W power supply to drive VMEbus and mass storage memory. 
— Cooling systems with four fans. 

— 20 slot J1-J2 VMEbus Motherboard. 



• Data Acquisition Module 

— Up to 8 Mbyte address space. 

— Jumper selectable address modifier code. 

- 32 Input Channels with data selectors. 

- VMEbus compatible data transfers supporting A24:D32, D8. 

— VMEbus Master bus control (Non-slot 1) 

• CPU Under Test 

— MC68000 Educational Computer Board (Motorola Inc.) 

- 4 MHz MC68000 16-bit CPU. 

- 32 Kbyte of DRAM and 16 Kbyte firmware ROM/EPROM mon- 
itor. 

- Two serial ports provided for a terminal and a host. 

• Fault Injection Module 

— Hardware fault injections on IC pin lines. 

- Single/multiple faults of stuck/bridging types with fault duration 

varying from 250 ns to . 

— Application program generated fault injection. 


1.3 System Customization 


This section describes the system customization required to implement 
the upset analysis experimental system. This also provides information on 
the programming of peripherals. 

• SYS68K/CPU-33XN 

- OS-9/68000' EPROM Installation 

* Remove VMEPROM* and install EPROMs for OS-9. 

* High — Socket J6, Low — Socket J4 
— EPROM Type Selection 

* 27512 EPROM 

* Jumperfield Bl: 1 to 12, 6 to 7 

- Interfacing PI/T2 User I/O Port 

* Device: MC68230 Parallel Interface/Timer (PI/T) 

* Accessible via the 8-bit local I/O bus. Table 1.1 shows the 
register layout of PI/T2. 

* User I/O port is available on P2 of VMEbus, shown in Table 

1 . 2 . 

— The Address Map 

* The address map of this CPU board is listed in Table 1.3. 

* A24: D32, D24, D16, D8 area: SRAM-6, ISCSI-1 

• SYS68K/SRAM-6 

— Address Modifier Selection 

* Standard Supervisor/Non-privileged Data Access 

* Address Modifier Code: 3D, 39 

* Jumperfield B4: 4 to 15, 2 to 17 
— VMEbus Interface 

* A24: D32, D16, D8 

* Standard Address Mode (A24) 


* Address: $XXOOOOOO — $XX2000000 (2 Mbyte) 

* Jumperfield B3: 18 to 15, 20 - 30 to 13 - 3 

t SYS68K/ISCSI-1 

— Address Modifier Selection 

* Standard Non-priviledged/Supervisory program and data Ac- 
cess. 

* Address Modifier Code: 3A, 39, 3E, 3D 

* Jumperfield B22: 5 to 2, 6 to 1 

— VMEbus Interface 

* A24: D16, D8 

* Address: $XXAOOOOO — $XXA1FFFF (128 Kbyte) 

* Jumperfield B21: 2 to 17, 4 - 7 to 15 — 12 


Table 1.1 PI/T2 Register Layout 


ADDRESS 

FF800E00 

FF800E01 

FF800E02 

FF800E06 

FF800E08 

FF800E0A 

FF800E0D 


REGISTER 
PIT2 PGCR 
PIT2 PSRR 
PIT2 PADDR 
PIT2 PACR 
PIT2 PADR 
PIT2 PAAR 
PIT2 PSR 


DESCRIPTION 

Port General Control Register 
Port Service Request Register 
Port A Data Direction Register 
Port A Control Register 
Port A Data Register 
Port A Alternate Register 
Port Status Register 


Table 1.2 PI/T2 User I/O Interface Signals 


PIN No. 

PORT No. 

IN/OUT 

P2/J2 No. 

SIGNAL 

4 

PAO 

OUT 

A29 

READY* 

5 

PAl 

OUT 

C29 

LW/B* 

6 

PA2 

OUT 

A30 

SLCTO* 

7 

PA3 

OUT 

C30 

SLGTl* 

8 

PA4 


A31 

ENBO* 

9 

PA5 


C31 

ENBl* 

10 

PA6 

■H 

A32 


11 

PA7 


C32 


13 

HI 


A27 


14 

H2 


C27 


15 

H3 


A28 


16 

H4 


C28 



Table 1.3 The Address Map 


START (HEX) 

END (HEX) 

SPACE 

DESCRIPTION 

00000000 

003FFFFF 

1.0 

MB 

Shared Memory 

00400000 

F9FFFFFF 

3.9 

GB 

A32: D32, D24, D16, D8 

FAOOOOOO 

FAFFFFFF 

16.0 

MB 

Message Broadcast Area 

FBOOOOOO 

FBFEFFFF 

15.9 

MB 

A24: D32, D24, D16, D8 

FBFFOOOO 

FBFFFFFF 

64.0 

KB 

A16: D32, D24, Dl6, D8 

FCOOOOOO 

FCFEFFFF 

15.9 

MB 

A24: D16, D8 

FCFFOOOO 

FCFFFFFF 

64.0 

KB 

A16: D16, D8 

FDOOOOOO 

FFFFFFFF 



System Area 


'OS-9 and OS-9/68000 are trademarks of Microware Systems Corporation. 
*VMEPROM is a PDOS based real time monitor. 










2. Data Acquisition Module 


When the fault is injected from the fault injection module, the data ac- 
quisition module is activated and activity data on 8 or 32 observation points 
are synchronously sampled with the clock of the target system and writt'*n 
into the SRAM memory module. 


2.1 Hardware Overview 

Basically, the data acquisition module generates the address signals from 
the clock of the target system and transfers the sampled data to the memory 
module via the VMEbus. 

A block diagram is shown in Fig.2.1. This board consists of the following 
functional blocks: 

• Clock Control (CKCTRL) 

• Address Generator (ADDGEN) 

• Address Modifier Selector (AMS) 

• Address Bus Buffers (ABUF) 

• Data Transfer Control (DTCTRL) 

• Input Channel Selectors (INSLCT) 

• Data Bus Buffers (DBUF) 

• Bus Master Control (BUSMST) 



Fig. 2.1 Functional Block Diagran 



















2.2 Clock Control 


• Recording Clock Selector 

- Jl-1, ICl-1 

- Selectable by bit 1 and 2 of Jl. 

* Clock of CPU Under Test; bit 1: ON, bit 2: OFF 

* 16MHz VME System Clock: bit 1: OFF, bit 2: ON 

• Clock Frequency Divider 

- Jl-2, IC2 

— Selectable by bit 3 - 7 of Jl as shown in Table 2.1. 

• Qualifier Trigger 

- ICl-2, IC3-1, IClO-1 

- Trigger: Fault injection signal transferred from FIM. 

— The trigger is enabled when ENBl is high. 

• Clear Control 

- Rl, ICl-3, IC16-1 

— Generate Clear Signal for the Clock Control, Address Generator, 
and Data Transfer Control. 

— Reset Signeds: System Reset, Bus Error, and End Address. 

• End Address Selection 

- J2-1 

- End address: $XXOFFFFF - $XX7FFFFF 

— Selectable by bit 1 - 4 of J2-1 as shown in Table 2.2. 


Table 2.1 Frequency Division Settings 

Division bit 3 bit 4 bit 5 bit 6 bit 7 
“ON OFF OFF OFF OFF 

OFF ON OFF OFF OFF 

OFF OFF ON OFF OFF 

OFF OFF OFF ON OFF 

16 I OFF OFF OFF OFF ON 


Table 2.2 End Address Selection 


End Address 

bit 1 

bit 2 

bit 3 

bit 4 

$XXOFFFFF 

ON 

OFF 

OFF 

OFF 

$XX1FFFFF 

OFF 

ON 

OFF 

OFF 

$XX3FFFFF 

OFF 

OFF 

ON 

OFF 

$XX7FFFFF 

OFF 

OFF 

OFF 

ON 






2.3 Address Generator 

• Address Signal Generator 

- IC4, 105, IC6, IC7, ICS, IC9 

- Implement 24- bit synchronous binary counter using a carry-look- 
ahead circuit. 

- Maximum clock frequency is calculated as follows: 

/max = IfiCLKtoRCOtptH + ENTtsu) 

— Address Space 

* Up to 8 Mbyte Address Space. Refer to Table 2.3. 

* Start address: $XX000000 (fixed) 

* End address: $XX0FFFFF - $XX7FFFFF (selectable) 

• Counter Status Output 

- IClO-2 

- When counters are enabled to count, ENBl* is asserted. 


Table 2.3 Address Space and End Address 


Address Space 

End Address 

1 Mbyte 

$XX0FFFFF 

2 Mbyte 

$XX1FFFFF 

4 Mbyte 

$XX3FFFFF 

8 Mbyte 

$XX7FFFFF 


2.4 Address Bus Buffers and Address Modifier Selector 

• Address Bus Buffers 

- IC12, IC13, IC14 

— Three transparent D-latches (74AS573) interface local address sig- 
nals with the VMEbus address bus. 

— DHBA* places the 24-bit outputs in either a normal logic state or 
a high-impedance state. 

• Address Modifier Selector 

- J2-2, RN, ICll 

- 6-bit Codes: Used for an additional decoding parallel to the ad- 
dress signals. 

— Address Mode: Supports the standard address mode (A24) for 
supervisor or nonpriviledged memory access. 

* 3E: Standard Supervisor Program Access 

* 3D: Standard Supervisor Data Access 

* 3A: Standard Non-Priviledged Program Access 

* 39: Standard Non-Priviledged Data Access 

— Selectable by bit 5 - 10 of J2 as shown in Table 2.4. 


Table 2.4 Address Modifier Codes and Settings 


HEX 

Binary 

bit 5 

bit 6 

bit 7 

bit 8 

bit 9 

bit 10 

3E 

111110 

OFF 

OFF 

OFF 

OFF 

OFF 

ON 

3D 

111101 

OFF 

OFF 

OFF 

OFF 

ON 

OFF 

3A 

111010 

OFF 

OFF 

OFF 

ON 

OFF 

ON 

39 

111001 

OFF 

OFF 

OFF 

ON 

ON 

OFF 



2.5 Data Transfer Control 


• Data Transfer Bus Control 

- ENBl, DWB* 

* IClO-3, IC15-1 

* When READY* asserted, both ENBl and DWB* are latched 
to be active. 

* LCLR* resets the outputs. 

- LAS* 

* R2, IClO-4, IC15-2, IC17-1, -2 

* When READY* asserted, LAS* is set to be active. 

* During data transfers, LAS* is asserted by LCLK and reset bv 

LDTACK*. ^ 

- LAOl, LDSO-1*, LLWORD* 

* IC16-2, -3, -4, IC18-1, -2, IC30-1, -2, -3, IC33-1 

* When LW/B* is high (long word mode), LDSO*, LDSl*, LAOl, 
and LLWORD* are set to low during data transfers. 

* When LW/B* is low (byte mode), LLWORD* is set to high 
and other signals respond as follows: 

LDSO* = QAOO, LDSl* = -QAOO. LAOl = QAOl 

• Data Bus Buffer Control 

- IC17-3, -4, IC18-4, -5 

- Long Word Mode (LW/B* is high) 

* During DHBD* is active, ENBL* is asserted and ENBB* is 
de-asserted. 

- Byte Mode (LW/B* is low) 

* During DHBD* is active, ENBB* is asserted and ENBL* is 
de-asserted. 


• Bus Release Control 


- IC31-1 

- Support Release On Request (ROR) operation. 

* Bus request signals (BRO-3*) will assert BREL to release 
BBSY* at the end of the current data transfer. 


2.6 Input Channel Selector and Data Bus Buffers 

# Input Channel Selector 

- IClO-5, -6, IC19, IC20, IC21, IC22 

- Implement 32-to-8 data selectors using four 4-bit data selectors. 

- Data selection is controlled by the two select inputs (SCLTO-1*) 
as shown in Table 2.5. 

• Data Bus Buffers 

- Long Word Mode 

* IC23, IC24, IC25, IC26 

* Four transparent D-latches (74AS573) interface 32-bit input 
data with the 32-bit VME data bus (DOO-31). 

* When LAS* is taken low, the outputs are latched to retain 
the data that was set up. Refer to Table 2.6. 

* ENBL* places the 32-bit outputs in either a normal logic state 
or a high-impedance state. 

— Byte Mode 

* IC27, IC32 

* Two transparent D-latches (74AS573) interface 8-bit local 
data bus (LDO-7) with the 16-bit VME data bus (DOO-15). 

* When LAS* is taken low, the outputs are latched to retain 
the data that wais set up. Refer to Table 2.6. 

* ENBB* places the 16-bit outputs in either a norma] logic state 
or a high-impedance state. 


Table 2.5 Input Channel Selection 


SLCTO* 


LD7 

LD6 

LD5 

LD4 

LD3 

LD2 

LDl 

LDO 

high 

high 

28 

24 

20 

16 

12 

08 

04 

00 

high 

low 

29 

25 

21 

17 

13 

09 

05 

01 

low 

high 

30 

26 

22 

18 

14 

10 

06 

02 

low 

low 

31 

27 

23 

19 

15 

11 

07 

03 


j ^ 


Table 2.6 (a) Active Portions of Data Bus 


DSl* 



LWORD* 

D24-31 D16-23 D08-15 DOO-07 

low 

low 

low 

low 

byte 0 byte 1 byte 2 byte 3 

high 

low 

high 

high 

byte 3 

low 

high 

high 

high 

byte 2 

high 

low 

low 

high 

byte 1 

low 

high 

low 

high 

byte 0 


Table 2.6 (b) Data Organization in Memory 



Operand 

Byte Address 


byte 0 

$XXX XXOO 

: 1 =^ 

byte 1 

$XXX XXOl 

! m 

byte 2 

$XXX XXIO 


byte 3 

$XXX XXll 









2.7 VMEbus Master Control 


• Master Bus Controller 

- IC28, IC29 

- VME 1220' provides two device chip set for non-slot 1 master bus 
controller. 

- Initiating a Bus Request 

* Drive BRO* low after receiving DWB* and LAS* asserted. 

— Arbitration 

* After receiving BGOIN* from daisy chained VMEbus grants, 
local arbiter arbitrates between DWB* and BGOIN. 

■ If DWB* wins the arbitration (i.e. DWB* occurs before 
BGOIN*), BBSY* will be asserted. 

• If BGOIN* wins, local arbiter will drive BGOOUT*, which 
passes the bus grant down the daisy chain to adjacent 
master in the system. 

— Data Transfer 

* Local master does not access the bus until the previous mas- 
ter has relinquished control of bus, which occurs when AS*, 
DTACK* and BERR* are de-asserted. 

* Support Address Pipelining using DHBA* and DHBD*. 

• Broadcast the address of the next bus cycle while the data 
transfer of the current cycle is occuring, i.e. DTACK* and 
DSn* are still low. 

• DHBA* is enabled as soon as AS* is disabled. 

• When DTACK* goes high, signifying the end of the current 
data cycle, DHBD* enables the data buffers for the next 
data cycle. 

* WRITE* is latched during address pipelining to hold its level. 

— Bus Release 

* Supports Release On Request (ROR) protocol via BREL. 

• Release the data transfer bus whenever another module 
requires it. 



Ml 1' m ' 


External bus request will assert BREL to release BBSY* 
at the end of the current data transfer. Refer to section 
2.5. 

If no bus requests are pending, the BREL will be kept 
de- asserted and the local master maintains BBSY* low to 
perforin continuous VMEbus data transfer cycles. 


^PLX Technology, 625 Clyde Ave., Mountain View, CA 94043 


3. Interface Signals 


3.1 VMEbus Interface 


This section provides information on VMEbus interface. Table 3.1 and 
Table 3.2 list Pl/Jl and P2/J2 pin assignments respectively. The PI connec- 
tor includes all the signals required for the 68000, The P2 connector provides 
expansion of both address and data buses to 32 bits and also provides 96 pins 
for user I/O lines. 

The data transfer bus is very similar to the 68000’s native buses except 
the following signals. Long word (LWORD*) is asserted for 32-bit data trans- 
fers. The 6-bit address modifier (AMO - AM5) allows the type of access to 
be specified. The bus error signal (BERR*) is typically used to indicate a 
memory error. 

The interrupt bus has seven interrupt request lines (IRQi*), an interrupt 
acknowledge (lACK*), and a daisy-chained priority signal (lACKIN*, lACK- 
OUT*). Each of seven lines corresponds to an interrupt priority level. 

The arbitration bus provides four levels of arbitration. For each level, 
there is a bus request signal (BRi*) and a bus grant daisy chain (BGilN*, 
BGiOUT*). The utility bus consists of SYSCLK, SYSRESET*, SYSFAIL*, 
ACFAIL*, and power supplies. 


Table 3.1 VMEbus Pl/Jl Pin Assignments 


PIN No. 

Pl/Jl ROW A 

Pl/Jl ROW B 

Pl/Jl ROW C 

1 

DOO 


D08 

2 

DOl 


D09 

3 

D02 

ACFAIL* 

DIO 

4 

D03 

BGOIN* 

Dll 

5 

D04 

BGOOUT* 

D12 

6 

D05 

BGIIN* 

D13 

7 

D06 

BGIOUT* 

D14 

8 

D07 

BG2IN* 

D15 

9 

GND 

BG20UT* 

GND 


SYSCLK 

BG3IN* 

SYSFAIL* 

11 

. GND 

BG30UT* 

BERR* 

12 

DSl* 

BRO* 

SYSRESET* 


DSO* 

BRl* 

DWORD* 

14 

WRITE* 

BR2* 

AM5 

15 

GND 

BR3* 

A23 

16 

DTACK* 

AMO 

A22 

17 

GND 

AMI 

A21’ 

18 

AS* 

AM2 

A20 

19 

GND 

AM3 

A19 


lACK* 

GND 

A18 

21 

lACKIN* 

SERCLK 

A17 

22 

lACKOUT* 

SERDAT* 

A16 

23 

AM4 

GND 

A15 

24 

A07 

IRQ7* 

A 14 

25 


IRQ6* 

A13 

26 

A05 

IRQ5* 

A12 

27 

A04 

IRQ4* 

All 

28 

A03 

IRQ3* 

AlO 

29 

A02 

IRQ2* 

A09 


AOl 

IRQl* 

A08 

31 

-12VDC 

+5VSTDBY 

+12VDC 

32 

+5VDC 

+5VDC 

+5VDC 









Table 3.2 VMEbus P2/J2 Pin Assignments 


- PIN No. 

P2/J2 ROW A 

P2/J2 ROW B 

P2/J2 ROW C 

1 


+5VDC 


2 


GND 


- 3 


RESERVED 


4 


A24 ■ 


5 


A25 


6 


A26 


7 


A27 


» 8 


A28 


9 


■H 


III 

O 




= 11 


A31 


12 


GND 


= 13 


+5VDC 


" 14 


D16 


15 


D17 


i 16 


D18 


" 17 


D19 


^ 18 


D20 


S 19 


D21 


" 20 


D22 


- 21 


D23 


H 


GND 


23 


D24 


= 24 


D25 


m 25 


D26 


26 


D27 


a 27 


D28 


^ 28 


D29 


_ 29 

READY* 

D30 

LW/B* 

- 30 

SLCTO* 

D31 

SLCTl* 

- 31 

ENBO* 

GND 

ENBl* 

32 


+5VDC 










3.2 Input Channels 


The input channels consit of data channels (DATAOO-31), clock (CLK), 
and trigger signal (TRIG*). Table 3.3 shows the pin assignments of the input 
channels. 


Table 3.3 Input Channel Pin Assignments 


PIN 

DAM Signal 

ECB Signal 

PIN 

DAM Signal 

ECB Signal 

(a) 



(b) 

GND 

GND 

(c) 



(d) 

GND 

GND 

1 

DATA04 

D04 

2 

DATA03 

DOS 

3 

DATA05 

D05 

4 

DATA02 

D02 

5 

DATA06 

D06 

6 

CLK 

4M-CLK 

7 

DATA07 

D07 

8 

DATA 14 

D14 

9 

DATA08 


10 

DATA 15 

D15 

11 

DATA09 


12 

TRIG* 

FIEN** 

13 

DATA 10 

DIO 

14 

DATAOl 

DOl 

15 

DATA 11 

Dll 

16 


E 

17 

DATA 12 

D12 

18 


AS* 

19 

DATA 13 

D13 

20 


UDS* 

21 

DATAOO 

DOO 

22 


LDS* 

23 

DATA31 

A15 

24 

DATA16 

R/W* 

25 

DATA30 

A14 

26 

DATA29 

A13 

27 

DATA28 


28 


FC2 

29 

DATA27 


30 


FCl 

31 

DATA26 

AlO 

32 


FCO 

33 

DATA25 

A09 

34 

DATA17 

AOl 

35 

DATA24 

A08 

36 

DATA 18 

A02 

37 

DATA22 

A06 

38 

DATA19 

AOS 

39 

DATA23 

A07 

40 

DATA20 

A04 

41 

DATA21 

A05 

42 


DTACK* 

43 


8M-CLK 

44 


6800IRQ* 

45 


IM-CLK 

46 


VMA* 


‘FIEN*: Fault 


Injection Enable, a signal transferred from the fault ir\jection module. 

















Appendix A Schematic Diagrams 


A.l Clock Control 
A. 2 Address Generator 

A. 3 Address Bus Buffers and Address Modifier Selector 
A. 4 Data Transfer Control 

A. 5 Input Channel Selector and Data Bus Buffers 
A.0 VMEbus Master Control 














CD 



QAOO-23 


















ENBl 

DWB* 


LAS* 

LAOl 

LDSO* 

LDSl* 

ENBL* 

ENBB* 

LLWORD* 

BREL 












SLCTO* 



DOO-31 
















Appendix B Parts List 


Table B.l DAM Parts List (1) 


LABEL 

Part Number 

Pins 

DESCRIPTION 

ICl 

74LS132 

14 

Quadruple Schmitt NAND gates 

IC2 

74LS161A 

16 

Synchronous 4-bil counter 

IC3 

74AS74 

14 

Dual D-type F/Fs 

IG4 

74LS161A 

16 

Synchronous 4-bit counter 

ICS 

74LS161A 

16 


IC6 

74LS161A 

16 


IC7 

74LS161A 

16 


ICS 

74LS161A 

16 


IC9 

74LS161A 

16 


ICIO 

74LS04 

14 

Hex inverters 

ICll 

74AS573 

20 

Octal D-type transparent latches 

IC12 

74AS573 

20 


IC13 

74AS573 

20 

0 

IC14 

74AS573 

20 


IC15 

74AS74 

14 

Du^J D-type F/Fs 

IC16 

74AS02 

14 

Quadruple 2-input NOR gates 

IC17 

74AS00 

14 

Quadruple 2-input NAND gates 

IC18 

74AS04 

14 

Hex inverters 

IC19 

74LS153 

16 

Dual 4-to-l data selectors 





Table B.2 DAM Parts List (2) 


Tabel 

Part Number Pms 

IC20 

IC21 

IC22 

IC23 

IC24 

IC25 

IC26 

IC27 

IC28 

IC29 

74LS153 16 

74LS153 16 

74LS153 16 

74AS573 20 

74AS573 20 

74AS573 20 

74AS573 20 

74AS573 20 

VME1220A 24 

VME1220B 24 

IC30 

IC31 

IC32 

IC33 

74AS02 14 

74LS20 14 

74AS573 20 

74AS00 14 


description 

Dual 4-to-l data selectors 


Octal D-type transparent latches 


VMEbus master controller 

(Non- slot 1, P-45) — _ 

Quadruple 2-input NOR gates 
Dual 4-input NAND gates 
Octal D-type transparent latches 
Quadruple 2-input NAND gaje^ 


Appendix C 


DAM Board Layout 


C.l Component Side Layout 
C.2 Wiring Side Layout 






































































Appendix D 


Copies of Data Sheets 


D.l VME 1220 Non-Slot 1 VMEbus Master Controller 



VME 1210/1220 



June 1990 


Slot 1 and Non-Slot 1 VMEb js Master Controllers 


—Distinctive Features — 

• VME 1210 provides two device chip set for slot 1 
master bus controller and single level arbiter 

— ' VME 1220 provides two device chip set for non-slot 1 
master bus controller 

" • Integrates 48ma and 64ma VMEbus slo- 
_ nals:AS*,DSO*,DSr,WRITE*,BR*,BBSY* 

• Integrates Input hysteresis buffers 

Supports Release When Done (RWD) and Release On 
— Request (ROR) protocols 

• Supports address pipelining, block transfers, and 
early BBSY* release 

Q Available In Commercial, Industrial and Military tem- 
perature ranges 


^Programmable Version Available 


If the VME 1210/1220 does not match the requirements 
of the design, a programmable version is available (the 
I PLX 464) which allows the user to customize all inputs, 
*- outputs and logic. Programming is performed using 
industry standard tools such as ABEL'" and CUPL* 
software arxl commonly available PLD programming 
hardware. Contact PLX for a data sheet on the PLX 464 
and other PLX products. 


Applications 

• VMEbus masters residing in stot 1 boards (VME 1210) 

• VMEbus masters residing in non-slot 1 boards (VME 1 220) 

General Description 

The VME 1210: The VME 1210 is comprised of the VME 
121 OA and the VME 121 OB for slot 1 applications. The 
devices are CMOS and packaged in 24 pin 300 mil wide DIPs 
or 28 pin J-type LCCs. The VME 1210A provides bus 
requesting, local arbitration, and single level system arbi- 
tration. The VME 1210B functions as the VMEbus controller. 
The requester initiates a VMEbus request from the local 
master's bus request for a data or interrupt cycle. The bus 
controller controls the bus after initiation of a bus cycle and 
relinquishes the bus at the end of the bus cyde. The bus 
controller supervises the handshaking between the local 
master CPU and the slave modules. 

The VME 1220: The VME 1220 is comprised of the VME 
1220A and the VME 1220B for non-slot 1 applications. The 
devices are CMOS and packaged in 24 pin 300 mil wide DIPs 
or 28 pin J-type LCCs. The VME 1220A provides bus 
requesting^and local arbitration. The VME 1220B functions 
as the VMEbus controller. The requester initiates a VMEbus 
request from the local master's bus request for a data or 
interrupt cyde. The bus controller controls the bus after 
initiation of a bus cycle and relinquishes the bus at the end 
of the bus cyde. The bos controller supervises the harid- 
shaking between the local master CPU and the slave 
modules. 



VME 1210 

Slot 1 
master 


Vcc 
DVB« 
LDS0« 
L0S1« 
BTACKB 
BCRRB 
DHBA« 
LASk 
R/V» 
BBSYir 
Connect to pin 17 Cj 
Vss 



I Vcc 

I AS* 

I LDTACKb 
I BHBD* 

I VSS 
I WRITE* 

I Vss 

I Connect to pin 1] 
I LfERR* 

I DSl* 

I Connect to V*s 
I BSO* 


VME 1210A 


VME laiOB 


Vcc cq 
BREL n 

LAS* 

sysreset* d 

DVB* n 
AS- 

Connect to pin 17 
Connect to p*n 16 

si 

BGIN_ 
Vss cj 



Vcc 
BR- 
BBSY- 
BCOUT- 
Vs» 

DHBA- 
vss 

Connect to pin 7 
Connect to pin 6 
MC 

Connect to pin 13 
Connect to pin 14 


VME 1220 

Non-slot 1 
master 



Vcc 

AS- 

lotack* 

DHBD- 

Vss 

WRITE- 

Vss 

Connect to pin 11 

LBERR- 

DSl- 

Connect to Vss 
OSO- 


VME 1220A 


VME 1220B 


Figure 1. Pinout of VME 1210/1220 (DIPs) 




[patent Pending 

^BEL is a trademark of Data UO Corp. 

CUPL is a trademark of Logical Devices, Inc. 




1000 


PLX Technology, Inc. 


1.069 

/ / < r V nen ^ s 


PRfiClCHNG PAGE BLANK NOT FILMED 



VME 1210/1220 


in Description 
VME 1220 A 


Pin# 

LCC 

Pin# 

DIP 

Signal 

RHI 

Function 

3 

2 

BREL 

■1 

Active high; Bus release signal indicating BBSY* can be 
released. 

4 

3 

LAS* 

1 

Active low; Address strobe from local master. 

5 

4 

SYSRESET* 

1 

Active low; VMEbus System Reset. 

6 

5 

DWB* 

■■ 

Active low; Device wants bus, local master requests con- 
trol of Ixis. 

7 

6 

AS* 

1 

Active low; VMEbus Address Strobe. 

9 

7 

- 

1 

Connect to pin 17 (DIP) or pin 20 (LCC). 

10 

8 

- 

1 

Connect to pin 16 (DIP) or pin 19 (LCC). 

11 

9 

NC 

— 

No Connect. 

12 

10 

NC 

1 

No Connect. 

13 

11 

BGIN 

■■ 

Active high; Inverted VMEbus Bus Grant In signal, 
BGIN*. 

14.21. 

24 


Vss 


Chip Ground. 

16 

13 

- i 

0 

Connect to Pin 14 (DIP) or Pin 17 (LCC). 

17 

14 

- 

1 

Connect to Pin 13 (DIP) or Pin 16 (LCC). 

18 

15 

NC 

0 

No Connect. 

19 

16 

- 

0 

Connect to pin 8 (DIP) or pin 10 (LCC). 

20 

17 

- 

0 

Connect to pin 7 (DIP) or pin 9 (LCC). 

23 

19 

DHBA* 

H 

Active low; Device has bus address, address buffer 
enable. 

25 

21 

BGOUT* 


Active low; VMEbus Bus Grant Out signal. 

26 

22 

BBSY* 

I/O 

Active low, 48 mA open collector; VMEbus Bus Busy 
signal. 

27 

23 

BR* 

m 

Active low, 48 mA open collector; VMEbus Bus Request 
signal. 

2.28 

1.24 

Vcc 


-1-5 V Chip Power 

1.8. 

15,22 

- 

NC 

- 

No Connect. 


m 

m 


P«6GK)4N€i PAGE BIANK ftOT FILMED 




































































VME 1210/1220 


Pin Description 

VME 1210B and VME1220B 



Function 


Active low; Device wants bus, local master wants control 
of VMEbus. 


Active low; Lower data strobe from local master. 


Active iow; Upper data strobe from local master. 


Active low; VMEbus Data Transfer Acknowledge, data is 
valid during a read cycie or data has been accepted from 
the bus during a write cycle. 


Active low; VMEbus Error signal. 


Active low; Device has bus address, address buffer 
enable. 


Active low; Address strobe from local master. 


Active high/low; Read or write cycle from local master. 


Active low; VMEbus Busy, local master controls bus. 


Connect to pin 17 (DIP) or pin 20 (LCC). 


Chip Ground. 


Active low; 64ma VMEbus lower Data Strobe signal, indi- 
cates valid data on bus. 


Connect to Vss. 


Active low; 64ma VMEbus upper Data Strobe signal, 
iTKlicates valid data on bus. 


Active low; Open collector signal, bus error to local mas- 
ter. 


Connect to pin 11 (DIP) or pin 13 (LCC). 


Active low; 48ma VMEbus Write signal, irxJicates bus 
read or write cycie. 


Active low; Device has bus data, data buffer enable. 


Active low; Open collector signal, data acknowledge to 
Vocal master. 


Active Vow; 64mA VMEbus Address Strobe signal, indi- 
cates valid address on bus. 


+5 V Chip Power 


No Connect. 


5 















































































VME 1210/1220 


VME 1210/1220 Timing Waveforms 


lirf 


B 



u>-r 

P«6G»4N6 PAGE BLANK NOT FILMED 




Iming Specification 


Timing 

Parameters 




Signals 


DWE* lo BR* assened 


U CT to BR* asserted 


BR* to BG asserted 


BGiN to BBSY* asserted 


BBSY" to BR* r^egated 


BBSY* to DHBA* assened 


BBSY* to BGIN negated 


DHBA* to DHBD* asserted 


DHBA* to WRITE* asserted 


DHBA* to AS* asserted 


AS* to DSn* asserted 


BGIN to BBSY* negated 



Max. "ime{ns) unless 
otherwise specified 



Description 


If DWB* is asserted afte-' LAS* 


If LAS* is asserted after DWB* 


VME 12*0 only when internal BR’ generated 
(B3 connected to BGIN) 


VME 121C only when external BR’ received 
(BG connected to BGIN) 


VME 1220 only 


VME 1210 only. Incudes delay line; 55ns lor 
M-65. 45ns for M^55, 35ns for C-45. 40ns for 
C-35, 60ns for C-25 part 


VME 1220 only 



VME 1210 only 


System arbiter time j System arbiter time | VME 1220 only 



90 

70 (min.) 



Conditional upon R/W* value 


Ensures 35ns minimum address lo AS* and 
data to DSn* set up times 



135 max 


105 min 


195 max 


165 min 


BREL to BBSY* negated 


DTACK* to LDTACK* asserted 


LDTACK* to LAS*/LDSn* negated 


LAS* to DHBA* negated 


DWB* to DHBA* negated 


LAS* to AS* negated 


LDSn* to DSn* negated 


LDSn* to DSn* negated 


DSn* to WRITE* negated 


DSn*/DTACK* to LDTACK* 
negated 


BGIN to BGOUT* asserted 


BGIN to BGOUT* negated 


Latest of LAS*/DWB* to AS* 
asserted 


Latest of DHBD*/LDS* to DS* 
asserted 



@ Local master 


@ Local master 


65 



VME 1210 only; t7min + tl2min a 90 ns min. 
BBSY* assertion 


VME 1220 only 


VME 1220 only, (see note below) 


Valid only when BREL is asserted after 
BGIN is negated 


Local master's time to negate strobes 


If DWB* already negated 


If LAS* already negated 



Ensures 10ns hold time 


Earliest negation of DSn* or DTACK* causes 
LDTACK* to be negated. 


VME 1220 only 


VME 1210 only 


Assertion time when already have bus 
(BBSY* asserted;. 


Assertion time when already have bus 
(BBSY* asserted) 


BMY* is guaranteed to be asserted for a minimum of 90 ns in the VME 1 21 OA devices and the C-45 device of the VME 1 220A, even if BGIN is negated 
immediately after BBSY* is asserted. For the C^5 and C-25 VME 1220A devices, the sum of the system arbiter *BBSY* asserted to BGIN* negated* 
tirr>e and the t1 2 minimum time on the VME 1 220A must be greater than 90 ns. Generally, this time wil! be taken up oompi telv by the system arbiter 
lime, however, if not, a delay line can be connected between pins 6 and 16 (DIP) or pins 10 and 19 (lCC) on the VME 122^s device to guarantee the 
90 ns minumum. For example, if the system arbiter "BBSY* asserted to BGIN* negated* time was 35ns (minV no delay line would be needed for the 
C-35 VME 1 220A device, since 35 4 7o > 90. However, a 1 0 ns delay line would be required for the C-25 VME 1 220A. 


> 90. However, a 1 0 ns delay line would be required for the C-25 VME 1 220A. 







































































































































appendix b 


FAULT INJECTION MODULE 
SCHEMATIC DIAGRAMS 


Ver. 1.0 













50 SHldS S 50UAITC 



CI6- 














PSn 











391 50 Shims 5 soua»i 










CS3* 














Fault 


Ref No. 

Part Number 

Size 

^ ICl 

SN74ALS520 

20 

IC2 

SN74ALS520 

20 

IC3 

SN74ALS138 

16 

IC4-1 

SN74ALS32 

14 

w IC5 

VME 2000 

24^ 

IC6 

SN74F374 

20 

g IC7 

SH74LS645-1 

20 

w IC8 

MC68230 P8 

48 

IC9-1 

SN74ALS04B 

14 

“ IClO-1 

SH74LS244 

20 

^ ICll 

SN74ALS161B 

16 



8^ 

S ICl 2 

SN74ALS520 

20 

" IC13 

SN74ALS520 

20 

^ IC14-1 

SN74ALS04B 

14 

S IC14-2 

SN74ALS04B 

14 

IC14-3 

SN74ALS04B 

14 

55 IC14-4 

SN74ALS04B 

14 

g IC14-5 

SN74ALS04B 

14 

IC15-1 

SN74ALS02 

14 

g IC16-1 

SM74ALS01 

14 

^ IC17 

SN74ALS153 

16 

IC18-1 

SN74ALS74A 

14 

« R2 


8 

"" DLl 

RWT050P 

14 


‘300mil 24 pin DIP 
* Single-in- line package 




Description 

8-bit Identity Comparator 
8-bit Identity Comparator 
3 to 8 Decoder 
Quad 2-Input OR Gates (1/4) 

Slave Nodule Interface Device 
Octal D-Type Flip-Flops 
Octal Bus Transceivers 
Peurallel Interface/Timer (PIT-0) 
Hex Inverters (1/6) 

Octal Buffers (1/2) 

4-bit Binary Counter 
R Network, seven 4.7X0 (1/7) 
8-bit Identity Comparator 
8-bit Identity Conq>arator 
Hex Inverters (1/6) 

Hex Inverters (2/6) 

Hex Inverters (3/6) 

Hex Inverters (4/6) 

Hex Inverters (5/6) 

Quad 2-Input HOR Gates (1/4) 

Quad 2-Input HAND Gates (1/4) 
Dual 1 of 4 Data Selectors 
Dual D-Type Flip-Flops (1/2) 

R Network, seven 4.7kfi (2/7) 

50ns Delay Line 


1 



Fault Injection Module 
Parts List (2) 


Ref No. 

Paort Number 

Size 

Description 

IC9-2 

SN74ALS04B 

14 

Hex Inverters (2/6) 

IC9-3 

SN74ALS04B 

14 

Hex Inverters (3/6) 

IC15-2 

SN74ALS02 

14 

Quad 2-Input NOR Gates (2/4) 

IC16-2 

SN74ALS01 

14 

Quad 2-Input NAND Gates (2/4) 

IC19 

SN74ALS163 

16 

Dual 1 of 4 Data Selectors 

IC20 

SN74ALS153 

16 

Dual 1 of 4 Data Selectors 

IC21 

SN74ALS153 

16 

Dual 1 of 4 Data Selectors 

IC22 

SN74ALS153 

16 

Dual 1 of 4 Data Selectors 

R3 


8 

R Network, seven 4.7kJl (3/7) 

R4 


8 

R Network, seven 4.7kfl (4/7) 

R5 


8 

R Network, seven 4.7kJ) (5/7) 

IC23 

MC68230 P8 

48 

Paurallel Interface/Timer (PIT-1) 

IC24 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC25 

SN74LS449 

16 

Bus Treuisceviers w/ Bit dir. 

IC26 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC27 

MC68230 P8 

48 

Parallel Interface/Timer (PIT-2) 

IC28 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC29 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC30 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC31 

MC68230 P8 

48 

Paurallel Interface/Timer (PIT-3) 

IC32 

SN74LS449 

16 

Bus Treuisceviers w/ Bit dir. 

IC33 

SN74LS449 

16 

Bus Treuisceviers w/ Bit dir. 

IC34 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC35 

MC68230 P8 

48 

Parallel Interface/Timer (PIT-4) 

IC36 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC37 

SN74LS449 

16 

Bus Transceviers w/ Bit dir. 

IC38 

SN74LS449 

16 

Bus Treuisceviers w/ Bit dir. 


2 







H 

o, 

H 

P 

O 


FIG. FAULT INJECTION NODULE (4-BIT) 







Additional Components for the New Experimental System 


Part No. 

Manufacturer 

Description 

Cost ($) 

MZ 7500 

MIZAR 

GPIB Interface Board for 
VMEbus 

695.00 


MIZAR 

Single Cable for MZ 7500 

75.00 

MacII488 

lOtech 

GPIB Controller Board for 
Mac II 

535.00 

PFG5105 

Tektronix 

Pulse Generator (demo) 

2,471.25 

PFG5105 

Tektronix 

Pulse Generator (new) 

2,800.75 

TM5006 

Tektronix 

Prog. Mainframe (demo) 

851.25 

FIM 

JHU 

48ch Fault Injector 


Mac II 

Apple 

Macintosh II 


SPARC 

Sun Micro. 

SPARCstation work station 























SPARC 



FAULT INJECTION EXPERIMENTAL CONFIGURATION 


















Targeted Features of the Fault Injection Module 


• Fault Injector 

— Provides 48 channels with bit-definable outputs using four PI/T 
(MC68230) and twelve bus transceiver (74LS446) chips. 

- Supports three output states (0, 1, and Z*) on each channel. 

- 2ch pulse generator is installed a source of fault injections. 

- Supports single/inultiple faults of stuck-at-0/1 types with dura- 
tion varying from 40 ns to 99.9 ms. 

• Word Recognizer 

— Provides a versatile trigger source for the fault injection and data 
acquisition. 

- Implements 16-bit word recognizer using a MC68230 PI/T and 
two 74LS686 magnitude comparators. 


*Z: High-impedance 


N94- 36066 ' 


/ / tf’-' 



Certification Trails 

Gregory F. Sullivan’ 

Dept, of Computer Science 
Johns Hopkins University 
Baltimore, MD 21218 


and Software Design for Testability 

Dwignt S. Wilson* Gerald M. Masson^ ^ 

Dept, of Computer Science Dept, of Computer Science 
Johns Hopkins University Johns Hopkins University 
Baltimore, MD 21218 Baltimore, MD 21218 


V 


Abstract 

This paper investigates design techniques which may 
be applied to make program testing easier. We present 
, inethods for modifying a program to generate addi- 
tional data which we refer to as a certification trail. 
This additional data is designed to allow the program 
output to be checked more quickly and effectively. Cer- 
tification trails [14, 16] have heretofore been described 
primarily from a theoretical perspective. In this paper, 
we report on a comprehensive attempt to assess experi- 
mentally the performance and overall value of the certi- 
fication trail method. The method heis been applied to 
nine fundamental, well-known algorithms for the fol- 
lowing problems; convex hull, sorting, huffman tree, 
shortest path, closest pair, line segment intersection, 
longest increasing subsequence, skyline, and voronoi di- 
agram. Run-time performance data for each of these 
problems is given, and selected problems are described 
in more detail. Our results indicate that there are many 
cases in which certification trails allow for significantly 
faster overall program execution time than a 2-version 
programming approach, and also give further evidence 
of tbe breadth of applicability of this method. 
"Keywords: Software design for testability, software 
fault detection, certification trails, error monitoring, 
design diversity, data structures. 


more quickly and effectively. Our previous work on cer- 
tification trails emphasized a theoretical perspective in 
which we proved that the asymptotic time complexity 
of the testing process could be reduced [14, 16]. In 
this paper, we report on implementations of the cer- 
tification trail method so as to 2 issess experimentally 
with run-time data the performance and overall value 
of the technique. We have implemented the certifica- 
tion trail method for nine fundamental and well-known 
algorithms of broad importance and applicability. For 
each algorithm, we have produced three implementa- 
tions: a version which produces the output; a version 
which produces the output and generates a certifica- 
tion trail; and a version which checks the output while 
utilizing the certification trail. Specifically, algorithms 
for the following problems are analyzed: huffman tree, 
shortest path, sorting, closest pair, line segment in- 
tersection, convex hull, longest increasing subsequence, 
skyline, and voronoi diagram. The scope of the algo- 
rithins considered gives credibility to the overall appli- 
cability of the certification trail method. Furthermore, 
comparisons of run-time data for each of the three ver- 
sions of each of the algorithms considered reveal many 
cases in which an approach using certification trails al- 
lows for significantly faster overall program execution 
time than a 2- version programming approach. 


1 Introduction 

We have examined a wide variety of fundamental 
algorithms to determine how they can be redesigned 
to allow for easier testability. To make the problem 
of testing the correctness of the output of a program 
more tractable we have found it is desirable to modify 
the program so that it generates additional data which 
we refer to as a cerixjxcaiion trail This additional data 
is designed to allow the program output to be checked 

' Researdi partially supported by NSF Grants CCR-8910S69 
and CCR-8908092 and an IBM Teduiology Literdiange Program 
Grant. 

^Rescarcli partiaUy supported by NSF Grant CCR-8910569 
and an IBM Technology Interchange Program Grant. 

’Researdi partially supported by NASA Grant NSG 1442 and 
an IBM Technology lnter<hange Program Grant. 


Paper 7.3 INTERNATIONAL TEST 

200 


2 Introduction to Certification Trails 

First, let us consider a basic method which is used 
to perform testing to detect software faults called N- 
version programming [1, 2]. This method utUizes N 
teams of programmers, each independently implement- 
ing separate programs based on a problem specifica- 
tion. The programs are executed on the same input and 
the outputs are compared. Errors caused by software 
faults are detected whenever the independently writ- 
ten programs do not generate coincident errors. Thus 
the technique exploits design diversity. Also, note that 
the method can detect hardware faults which affect the 
separate executions in distinct ways causing distinct 
outputs. It is particularly valuable for detecting errors 
caused by transient fault phenomena. The N-version 
programming method can be used to detect faults af- 

CONFERENCE 1993 

0-7803-1429-8/93 $3.00 ® 1993 IEEE 


PRfiCifWS F-AGE BLANK NOT FflMCD 



— Version 

piogramming ' 


Oriification 

,TVaU 


Primary 

Execution 


Secondary 

Execution 


Compare 


Primary Secondary 
Execution Execution 


Compare 


^ Figure 1: Timeline Comparison of the Certification 
with 2- Version Programming 


f 


■ to * system has been put into production or it can be 

- used to detect faults in a testing phase prior to produc- 

- two. If two teams are used then we refer to the method 
«i 2-version programming. 

The certification-trail technique is designed to pro- 
_«de similar capabilities for detecting software and 
. hardware faults as 2-version programming but expend 

above the central idea 
so that, with modest 
addiUonal overhead, it leaves behind a trail of data 

t^ail. This data is chosen 
^ SO that It can allow the second algorithm to execute 
^more J“*cUy and/or have a simpler structure than the 
first algorithm. As above, the outputs of the two exe- 

rfthey agree. An illustration of typical execution times 

certification trail 

^method 13 given in Figure 1. We assume that the two 
^mplementations developed for 2-version programming 

rZ it "T ‘kis m«hod 

Nttt , educed 

the introduction of data dependency between the 

o program executions. For example, suppose the first 

^ropam execution contains an errof which caS^ an L 

■Wated. Further suppose that no error occurs during the 

bTe th?t t ®‘1'1 appears pos- 

able that the execution of the second program might 

would h ‘lie second execu- 

would be Tooled” by the data left behind by the 

tC r 

^B eitbpr ^ demand that the second execu- 

•«or has be*rdetectT"^‘ ^ 


^ Definition of a Certification 

e.rW^r definition of a 

certification trail and discuss some aspects of its real- 
izations and uses. 

Definition 3.1 A problem P is formalized as a rela- 

pairs- Let D be the domain 
( IS, the set of inputs) of the relation P and let S 
be the range (that is, the set of possible solutions). We 
say an algorithm A solves a problem P iff for all j e D 
when d .input to A then an s e S is output such that 

tfof to ^ l P • D ^ S be a problem. A solu- 

tion to this problem using a ceritficaUon fra, /consists of 

rang^^F n domains and 

rang« F, . D S x T and Fj : D x T S U {error) 

s^trsft thrill The functions must 

satisfy the following two properties: 

F * e S and / G T such that 

fO't f ‘ = « and (d,s) e P 

(2) for all d 6 D and all < e T either 

(F 2 (d, <) = s and (d, S) e P) or F^(d, t) = error. 


implemented so 

that they map elements which are not in their respec- 
tive domains to the error symbol. Intuitively, the first 
condition states that if both parts of our solution exe- 
ute correctly, then their answers agree and are correct 
The ^cond condition states that a correct secondary 
execution will never produce an incorrect output, i e 
one that is not a solution to the problem. 

itvif capabil- 

y the certification-trail approach is similar to that 
obtained with a 2-version programming approach dis- 
cussed earlier. That is, if a software or hardware fa 'It 

fanh'* ^ K “5 executions then either the 

fault will be detected or the output will be a correct sc^ 
lution to the problem. The examples in this paper will 
indicate that this new approach can save overall execu- 


4 


Certification TYail Examples 


of I I fi*" evaluate the use 

of certification trails for nine classic problems in com- 

pu er science. We have implemented algorithms for 

xe^rato”'" T* algorithms which 

generate and use certification trails. In addition we 


OIWQINM. PAOt 18 
OF POOR QUALITY 


Paper 7.3 
201 


discuss a general technique for construction of certifi- 
cation trails for algorithms using a wide range of data 
structures. This technique is used to implement the 
certification trails for several of our examples. 

We provide a full description of the algorithm for the 
convex hull problem which generates a certification trail 
and a full description of the algorithm which uses that 
trail. Because of space considerations the discussion 
of the other algorithms is abbreviated. In some cases 
references to previous publications or technical reports 
which describe the algorithms more fully are given. 

The algorithms we have chosen to implement are 
not always the algorithms which have the smallest 
asymptotic time complexity. Often the asymptoti- 
cally fastest algorithms have large constants of pro- 
portionality which make them slower on the data sizes 
we examined. We modified and used some programs 
from major software distributions such as quicker-sort 
from a Berkeley Unix distribution. Fortune’s algo- 
rithm for computing the Voronoi diagram was obtained 
from an Internet site at ATkT Bell Labs. Other algo- 
rithms were based on textbook discussions. It should 
be stressed here that this research is continuing as 
we further increase our corpus of algorithm and data- 
structure implementations. 

4,1 Explanation of timing data 

We have collected timing data for the algorithms on 
a Sun SPARCstation ELC with I6MB of RAM. The 
system was run as a standalone machine in single user 
mode during the timing experiments. Timing data was 
obtained through the getrusage() system call. The user 
times are reported in the data. 

Much of the data presented in the timing table is 
essentially self-explanatory relative to the certification 
trail technique and algorithms considered. However, a 
brief discussion of the table entries is appropriate. 

The column labelled Basic contains timing data 
which gives the execution time of the algorithm in pro- 
ducing the output without the generation of the certi- 
fication trail. All timing data is listed in seconds. 

The Primary ExtcuHon (Prim. Exec.) column gives 
the execution time of the algorithm in producing the 
output with the additional overhead of generating the 
certification trail. 

The Secondary Execution (Sec. Exec.) column gives 
the execution time of the algorithm in producing the 
output while using the certification trail. 

The Percent Savings (% Sav.) column records 
the percentage of the execution time savings which is 
gained by using the certification trail method as com- 
pared to 2- version programming approach. This as- 


sumes that both versions take approximately the same 
amount of time to execute. 

The Speedup column is the ratio of the run times of 
the Basic Algorithm and the Secondary Execution. 

For the Huffman tree data, the input size for the 
Huffman tree program is the number of nodes. Each 
node is given a frequency, chosen uniformly from the 
integers {1, 2, ..., n}. n was also selected to be the 
number of nodes. 

For the shortest path table, there are two numbers 
associated with the input size, the first is the number of 
vertices in the graph, the second the number of edges. 
A graph with the required edges is selected uniformly 
from the set of all such graphs, then tested for connect- 
edness in order to assure that paths exist to all vertices. 

For the geometric algorithms, the input size is the 
number of points (or lines) in the original data set. 
Point set input was generated by choosing points with 
integer coordinates uniformly over a large square (typ- 
ically 1,000,000 by 1,000,000 or larger square). For the 
Line Segment Intersection problem, lines were gener- 
ated by picking a line segment start point uniformly 
from a large square and picking offsets for x and j/- 
coordinates from a smaller range to give the end point 
of the line segment. This was done to bound the line 
length and avoid data sets resulting in a quadratic num- 
ber of intersections. 

Data for the longest increasing subsequence problem 
was produced by generating a random permutation of 
[I-.A^] for input size N , 

Sorting was performed on an array of pointers to 
structures. It was assumed that each structure con- 
tains an extra integer field for use in generating the 
certification trail. Sorting was performed on integer 
keys, though the technique can be used with a more 
complex key (in fact, using complex keys is very likely 
to increase the speedup achieved). Integers were chosen 
uniformly from interval [1..1, 000, 000, 000]. 

4.2 Convex Hull Example 

The convex hull problem is fundamental in the field 
of computational geometry. Our certification trail so- 
lution is based on a convex hull algorithm due to Gra- 
ham [6] called Graham’s Scan. For basic definitions in 
computational geometry see the text of Preparata and 
Shamos[ll]. For simplicity in the discussion which fol- 
lows we will assume the points are in general position, 
e.g., no three points are collinear. It is not hard to 
remove this restriction. 

Defiuitiou 4.1 The convex hull of a set of points, T, 
in the Euclidean plane is defined as the smallest convex 
polygon enclosing all the points. This polygon is unique 


Paper 7.3 
202 



ipd its vertices are a subset of the points in T, It is 
ipecified by a counterclockwise sequence of its vertices. 

The algorithm given below constructs the convex 
hull incrementally in a counterclockwise fashion. The 
Irst step of the algorithm selects an “extreme” point 
usd calls it Pi . The next two steps sort the remaining 
points. The order of the points is determined by the 
A>pes of the line segments formed by joining each point 
lopi- It is not hard to show that after these three steps 
the points when taken in order, pi,P2> • • • iPn, form a 
rimple polygon; although this polygon may not be con* 
nx. The Graham Scan algorithm traverses this poly- 
gon, removing points until the resulting polygon is con- 
vex, The main FOR loop iteration adds vertices to the 
polygon under construction and the inner WHILE loop 
lemoves vertices from the construction. A point is re- 
eved when the angle test performed at line 6 reveals 
tiiat the angle at that vertex is obtuse. It is easy to 
demonstrate that when a point is removed, it must fall 
within the triangle defined by three other points, pi and 
the two points that were adjacent to the point removed. 
When the main FOR loop is complete the convex hull 
has been constructed. The execution of this algorithm 



Figure 2: Convex hull example. 


Point not on 
convex hull 

Three surrounding points 

P3 

PliP2,P4 

Ps 

Pl,P4iP6 

P7 

PiiPe.Ps 


Is demonstrated in Figure 2. For each removed point, 
the associated triangle is indicated in bold lines, and in 
the text below the diagram. Our certification trail relies 
oh the fact that that these triangles can be determined 
quickly. 

Algorithm CONVEXHULL(T) 
hfnt: Set of points, T, in 
Otipui: Counterclockwise sequence of points in 
which define the convex hull of T 
F Let Pi be the point with the largest 

X coordinate (and smallest y to break ties) 
i For each point p (except pi) calculate 
’ the slope of the line through pi and p 
Sort the points (except pi) from smallest 
J*' slope to largest. Call them P2, - - 1 Pn 
•= P \ ; 92 •= P2i 93 ■= Ps; m = 3 
for fc = 4 to n DO 
WHILE the angle formed by 
' 9 m- 1 , 9 m , Pfc is > 1 80 degrees 
DO m := m — 1 END 
^ m := m -h 1 
^ 9m := Pit 

• END FOR 

10 FOR i = 1 to m DO, OUTPUT(qO 
end FOR 
£nd CONVEXHULL 

First execution; In this execution the code CON- 
^EXHULL is used. The certification trial is generated 


by adding an output statement within the WHILE loop. 
Specifically, if an angle of less than 180 degrees is found 
in the WHILE loop test then the four tuple consisting 
of 9m » 9m- ii Pi iP* is output to the certification trail. 
The final convex hull points 91, 9m are also output 
to the certification trail. Strictly speaking the trail out- 
put does not consist of the actual points in R^. Instead, 
it consists of indices to the original input data. This 
means if the original data consists of pi , P2, . - , Pn then 
rather than output the element in corresponding to 
Pi the number t is output. 

Second execution: Let the certification trail con- 
sist of a set of four tuples, (xj, ai, 61, ci), (x2, 02,^21 <^2)» 

. . . , (xr, Or, 6r, Cr) followed by the supposed convex hull, 
9i, 92, ■ • • , 9m- The code for CONVEXHULL is not used 
in this execution. Indeed, the algorithm is dramatically 
different than CONVEXHULL. 

It consists of five checks on the trail data. 

• First, it checks that there is a one to one correspon- 
dence between the input points and the points in 
{xi,...,Xr}u{gi,...,9m)- 

• Second, it checks that for each i € {1, . . .,r), ai, 
6, , and c* are among the input points. 

• Third, the algorithm checks that for each 1 € 

r), Xi lies within the triangle defined by 
a»,6,, and c,-. 


Paper 7.3 
203 


OWQJNAL PAGE fS 
OF POOR QUALITY 



♦ Fourth, the algorithm checks that for each triple 
of counterclockwise consecutive points on the sup- 
posed convex hull, the angle formed by the points 
is less than or equal to 180 degrees. 

• Fifth, it checks that there is a unique point among 
the points on the supposed convex hull which is a 
local maxima. We say a point q on the hull is a local 
maxima if its predecessor in the counterclockwise 
ordering has a strictly smaller y coordinate and its 
successor in the ordering has a smaller or equal y 
coordinate. 

If any of these checks fail then execution halts and 
“error” is output. Otherwise the convex hull read from 
the trail is output. As mentioned above, the trail data 
actually consists of indices into the input data. This 
does not unduly complicate the checks above; instead 
it makes them easier. The correctness and adequacy of 
these checks must be proven. A complete formal proof 
is beyond the scope of this paper, instead a brief outline 
of the proof will be given. 

Using our formal definition of certification trails, let 
D be the set of all finite planar point sets T. Let S 
be the set of convex polygons, with vertices in coun- 
terclockwise order (the restriction to counterclockwise 
ordering makes the convex hull unique). Then the 
problem we are considering is HULL : D S where 
BULL{T) is the polygon in S that forms the convex 
hull ofT. 

The description of the algorithms above defines func- 
tions F\ and Fj. We must show that both conditions of 
Definition 3.2 hold. The following two lemmas, which 
we state without proof, are required. 

Lemma 4.2 Lei P 6e a polygon on n points 
Pi »P2, • ■ ' »Pn- P is a convex polygon iff P is simple 
and each angle PiPjPk i^ /ess than or equal to 180 de- 
grees, where i is in l,2,...n, j = (i -|- 1) mod n, and 
Jb = (i -h 2) mod n. 

Lemma 4.3 If P is a non-simple polygon, then either 
P has more than one local maxima, or the interior angle 
at some vertex is greater than 180 degrees. 

These are deceptively simple statements. Though 
they are intuitively obvious, a formal proof is difficult. 
It is interesting to note that some computer graphics 
texts give an incorrect test for determing convexity of 
a polygon by omitting the check for simplicity required 
by Lemma 4.2. 

Recall that the first condition is: 

For all d 6 D there exists s 6 S and t such that 
F\{d) = (s,f) and F 2 {d,t) = $ and {d,s) 6 P* 


Intuitively, this means that if both executions per- 
form correctly then they will both output the convex 
hull of the input, which is unique. Note that genera- 
tion of the certification trail does not affect the output 
of the Graham Scan algorithm. Thus the condition 
on Fi(d) is satisfied by the correctness of the Graham 
Scan algorithm, the proof of which is well known [II]. 
To show that F 2 {d,t) ^ s, note that a copy of s is con- 
tained on the trail t. Our description of F 2 {d,t) states 
that s is output unless one of the five checks above 
fails. It is trivial to verify that the first three of these 
checks must be satisfied. The fourth check cannot fail, 
since the polygon described by s is convex (because 
(d, s) E P). Similarly, if the fifth check fails, then the 
polygon described by s has two local maxima, and this 
is not possible for a convex polygon. 

The second condition is: 

For all d G D all f G T either (F 2 (d, t) = s and 
(d, s) G P) or F^id, <) = error. 

Intuitively, this means that given an input and arbi- 
trary trail, F 2 (d, f) produces a solution to the problem 
or flags an error. 

Our definition of F 2 (d, f) states that the polygon Q 
stored on the trail is output unless one of the five checks 
fails. We must therefore demonstrate that if all five 
checks succeed, then Q is the convex hull of the input 
points d. Let H be the convex hull of the points d, 
The first condition guarantees that every point in d 
is classified as a hull point or an interior point. The 
second condition guarantees that the triangles used to 
identify interior points are formed from input points, 
and the third check verifies that the interior points are 
indeed inside their respective triangles. Note that we 
do not attempt to verify that the triangles used are the 
ones that would be produced by Fi(d). In general, for 
a given interior point, there may be several triangles of 
input points in which it is contained. Together, the first 
three conditions imply that all points in H are also in Q, 
since it is impossible for a hull point to be contained in 
a triangle. Note that these three checks do not exclude 
the possibility that interior points are present in Q, nor 
do they guarantee that the ordering of the hull points in 
Q is correct. The final two checks will accomplish this. 
If the last two checks are satisfied. Lemma 4.3 states 
that Q is simple, and therefore it must be convex by 
Lemma 4.2. 

Thus, Q is a convex polygon whose vertex set is a 
superset of the vertices of //, i.e., H is contained in 
T. This implies that no other point from the input 
set may be a vertex of Q, since any input point that 
is not a hull point is interior to H and therefore inte- 
rior to Q. Finally, it is clear that the ordering of the 
vertices of Q and H must be the same (although there 


Paper 7.3 
204 


—might appear to be two possible orderings, clockwise 
itod counterclockwise, a clockwise ordering will fail the 
fourth check). Therefore if all five checks succeed, then 
_the output of F 2 {d,t) will be the convex hull of d. 

This demonstrates that the algorithms described 
meet the conditions of Definition 3.2, and are therefore 
certification trail solution to the convex hull problem. 
Time complexity; In the first execution the sort- 
ing of the input points takes 0{n log(n)) time where n is 
. _;he number of input points. One can show that this cost 
-dominates and the overall complexity is 0(n log(n)). 

It is possible to implement the second execution so 
_ bat all five checks are done in 0(n) time. The first two 
_:hecks may be done in linear time since the certification 
trail contains indices into the input data. The third 
.nd fourth checks require a constant time calculation at 
^ach point. Finally, the uniqueness of the local maxima 
”S clearly checkable in linear time, 

Order-of-Magnitude Testing Speedup: It 

bould be noted that for the convex hull problem, we 
-.se seeing an order of magnitude speedup for reason- 
able sized problems. We believe this offers a dramatic 
emonstration of the efficiency of our proposed software 
_»ting technique using certification trails in compari- 
son with the 2-version programming technique. 


Size 

L_. 

Basic 

Prim. Exec. 
(Also Geu. 
Trail J 

Sec. 

Exec. 

% 

Sav. 

Speedup 

5000 

0.64 

0.67 

0.08 

41.41 

8.00 

10000 

1.38 

1.40 

0.17 

43.12 

8.12 

j — ^^25000 

3.80 

3.84 

r 0.46 

1 44.73 

8.46 

I 50000 
[ *00000 

8.44 

17.36 

8.50 

0.85 

44.61 

9.93 

17.68 

1.65 

44.33 

10.52 


Table 1: Convex Hull 


4^3 Sorting Example 

important problem has a massive literature In 
fro^ section we will discuss how to apply the certifi- 
^lon trail approach to the sorting problem. Let us 
i^uine that the sorting algorithm takes as input an ar- 
of n elements and outputs an array of n elements 
I he algorithm is supposed to place the data in non- 
a^reasing order. 

d^ign a certification trail algorithm we must dis- 
the nature of the data that should be included 
u^he certification trail to allow quick computation 
oije final output sorted array. Suppose that we de- 
to use the output array itself as the certification 
vrail. We note that it is easy to check that this array is 
n _on-decreasing order by simply performing a single 


pass ov^ the array. Unfortunately, it is considerably 
more difficult to make sure that this array contains ex- 
actly the same elements as the original input array. In- 
de^ this problem has a lower bound time complexity 
of J2(n log(n)) in a comparison based model. 

Because of this difficulty we use the permutation of 
the elements defined by the input and output data ar- 
rays as the certification trail. This permutation is com- 
puted by attaching an Item Number field to the data 
elements before sorting. The i-th item receives item 
number i. After the elements are sorted, the permu- 
tation froin input to output is obtained by reading the 
Item Numbers from the elements in their new order 
The second execution reads the permutation from 
he trail and verifies that it is a permutation on n el- 
einents, i.e., that no numbers are repeated or omitted 
I his permutation is used to rearrange the input ele- 
rnents in linear time. Finally the algorithm checks that 
these elements are now in non-decreasing order 


Size 

Basic 

Prim. Exec. 
(Also Gen. 
TraO) 

Sec. 

Exec. 

Sav. 

Speedup 

10000 

0.28 

0.30 

“0.04 j 

39.29 

7^00 

50000 

1.80 

1.90 

0.19 

41.94 

9,47 

100000 

3.96 

TM 

0.4 J 

43.31 

9.66 

500000 

23.95 

24.69 

2.14 

43.99 

11.19 

lOOOOOO 

50.23 

51.57 

4.38 

44.31 

11.47 


Table 2: Sort 




irails J-’or Abstract Data 

Types 

B^efore we present the rest of our example algorithms 
we discuss a general technique applicable to many al- 
gonthms and data structures. 

An abstract data type is a data object or set of data 
objects together with a group of operations for manip- 
ulating the object(s). Each operation takes a (possibly 
empty) set of arguments, and some, but not necessarily 
all operations return answers. Many algorithms make 
extensive use of abstract data types. 

We describe a method for automatically generating 
a certification trail for an algorithm which uses an ab^ 
stract data type. This is done by modifying the ab- 
stract data type operations, so that during the first 
execution they generate a certification trail, and dur- 
ing the second execution they use the certification trail. 
Otherwise, these operations are identical to the original 
abstract data type operations, i.e., they take the same 
ype of arguments and have the same return types. The 
object of creating and using the certification trail is to 


OIMGINAL PAGE IS 
OF POOR QUALITY 



allow a more efficient implementation of the abstract 
data type during the second execution. 

We illustrate this technique for the following ab- 
stract data type which we call Ordered Colleciion. An 
Ordered C-ollection will contain a set of pairs («,x) 
where i is an item number, and z is a real number value. 
(This selection is made for simplicity of description, the 
elements being stored could be more complex). No two 
elements of the set may have the same item number, 
though several items may have a common value. We 
define a total ordering on pairs by (i,x) < iff 

X < x' or X = x' and t < j' 

The following operations are defined on an Ordered 
Collection: 

INSERT(*,x) Add the element (t, z) to the set. 

DELETE(i) Delete the element with item number i 
from the set. 

PREDECESSOR(t) Let (i,x) be the element in the 
set with item number i. This operation returns 
its predecessor, that is, the largest pair less than 
(t,x). A special value SMALLEST is returned if 
(t, x) is the smallest element in the set. 

MIN Return the smallest element in set. 

NEAREST(x) Return the element from the set with 
value closest to x. If there is a tie, return the 
element with the smallest item number. 

This small set of operations is being chosen for con- 
creteness, several additional operations could be easily 
defined. If an error occurs during any of these opera- 
tions, for example, inserting pairs with duplicate item 
numbers or attempting to delete a non-existent item, 
then the program terminates indicating an error. 

These operations may be modified to produce a cer- 
tification trail during the first execution by modifying 
the INSERT(i,x) and NEAREST(x) operations to do 
the following (in addition to their normal function): 

INSERT(i,x) After adding this element to the set, 
perform a PREDECESSOR(i) operation and write 
the item number of the answer to the certification 
trail. 

NEAREST(x) Write the item number of the answer 
to the certification trail. 

A typical implementation of an abstract data 
type supporting the above operations would require 
fi(rilog(n)) time to process a sequence of n operations. 
By using the certification trail, we can achieve linear 
time for n operations during the second execution. This 


includes the time necessary to check the trail for cor- 
rectness as well as use it. 

The implementation of the Ordered Collection for 
the second execution will be a structure called an in- 
dexed linked list. This is a doubly linked list, along 
with an array Items of pointers, indexed by item num- 
ber. The t-th element in this array points to the list 
node for the element with item number t (or is NULL if 
no element in the list has item number i). This allows 
us to find an element in constant time given its item 
number. The elements themselves are maintained in 
ascending order (according to the pair ordering given 
above) on a doubly linked list, i.e., each element has 
pointers to its successor and predecessor. In addition 
to the array, we maintain a variable Starts which stores 
the item number of the first element in the list. 

The abstract data type operations for the second 
execution are defined as follows: 

INSERT(i,x) Read the item number p from the trail. 
p is the item number that would be the predecessor 
of (i, x) if it were in the set. //e7us[p] points to 
the list node for the element with index p, call 
this element (p, Xp). We can insert (i, x) after this 
node using ordinary list operations. Before doing 
so, however, we make three checks: 

i. Check that /<ems[i] is currently NULL, i.e., 
there is not currently an element with item 
number i in the set. 

ii. Check that (i,x) is greater than (p, Xp), 

iii. Check that (i,x) is less than the successor of 

(p.*p) 

If these checks are satisfied, then (i, x) may be in- 
serted after (p, Xp). Set /<ems[t] pointing to the 
list node for (i, x). 

Note that special cases occur at the beginning and 
end of the list. We omit the specifics of these cases, 
mentioning only that />(ari must be updated for 
insertions at the front of the list. 

DELETE(i) Check that /iems[i] is not NULL, i.e., 
there is an element with item number i currently 
in the set. If so, remove it from the linked list, 
and set 7fcms[i] to NULL. If we remove the first 
element of the list we must also update Start. 

PREDECESSOR(i) /fcms[»] points to the element 
with item number i, and its predecessor may be 
found by following the appropriate pointer. 

MIN The variable Start indicates the item number of 
the first element on the list, i.e., the minimum el- 
ement. ItemslSiari] therefore points to this ele- 
ment. 


Paper 7.3 
206 


NEAREST(i) Read the index i from the trail. 
: /ifm5(i] points to the element having this item 

number, call it (i, v). To verify that this is the cor- 
rect answer we will have to check one of its neigh- 
I : bors. If V < X, then only the successor of (i,x) 

|J could have a value closer to v. Otherwise, only the 

predecessor is a candidate. Check the appropriate 
neighbor. 

E :: ;i 

^ Although our example uses elements that contain 
item numbers, it is not necessary that the abstract data 
f ypc be defined in this way. The insert operation of an 
^abstract data type may be modified to tag elements 
with item numbers as they are inserted, 
r I Variations on this scheme are possible. For exam- 
; >lc, by modifying DELETE(i) and NEAREST(x) op- 
erations so that they also write the item numbers of 
oredecessors to the trail, it is possible to use a singly 
linked list during the second execution. More sophis- 
ticated schemes, involving marking list nodes for dele- 
tion and delayed checks, allow the use of singly linked 

- ists without requiring DELETE(i) and NEAREST(x) 
\o produce predecessor information. 

The technique in this example generalizes to other 
abstract data types supporting a predecessor operation, 
"n fact, a somewhat weaker condition often suffices; it 
^ sufficient that the specific implementation of the ab- 
stract data type allow the predecessor of an element 
I X) be found at the time the element is inserted. The 
^abstract data type itself need not support a predeces- 
sor operation. This technique is used in four of our 
^ example algorithms. 

Using this technique, it is possible to reuse the first 
execution code, except for the code implementing the 
abstract data type operations. One advantage of this 

1 s that it may be possible to add extra checking to such 
^odc, such as bounds checking and checks on pointer 

references, that may be too expensive to include in the 
-_irst execution. Of course, the two programs may be 
^leveloped separately as long as the specifications agree 
on the use of the abstract data type. 

- ' Space does not permit a full proof of correctness of 
rhis scheme. A proof proceeds by establishing the fol- 
lowing invariants on the indexed linked list used in the 

second execution. 

^ i. The pairs in the linked list are in order from small- 
est to largest. 

ii. Each element of the Items array is either NULL or 
* points to one of the nodes in the linked list. 

- iii. If Iieyns[t\ is not NULL, then the list node pointed 

2 to by it stores an element with item number i. 


(Note that this implies that each list node is 
pointed to at most once). 

iv. Every node in the list is pointed to by some item 
in /ferns [i]. 

V. Start is the item of the first element in the list. 

These conditions are clearly satisfied by an indexed 
linked list containing no elements (i.e., before any oper- 
ations have been performed). Inspection of operations 
that query the list (MIN and NEAREST for example) 
shows that they function correctly if the above condi- 
tions are met. It is easy to prove correctness of the 
certification trail by demonstrating that the operations 
maintain a one to one corresponce between the pairs 
in the linked list and the elements in the abstract data 
type and that the above invariants are preserved. 

4.5 Shortest Path Example 

This is another classic problem which has been ex- 
amined extensively in the literature. Our approach is 
applied to a variant of the Dijkstra algorithm [3] as 
explicated in [17]. We are concerned with the single 
source problem, i.e., given a graph and a vertex s, find 
the shortest path from s to v for every vertex v. 

The algorithm for this problem which has the fastest 
asymptotic time complexity uses fusion trees and is 
given in [5]. This algorithm however appears to have 
a large constant of proportionality and therefore we do 
not use it. 

We use the techniques just discussed to implement 
the certification trail for this problem. A full descrip- 
tion may be found in a technical report (15). 




IBS! 

S«c. 

Exec. 


■1 

1 100,1000 1 


0.05 


Ui£l 



1 o-is i 

0.16 

wmm 




mEM 

0.33 

1 o.n I 


2 82 

■nir«Ti»ri!gi!»M 

WESEM 

0 76 



3 04 



1.67 

0 45 


3 51 

1 2500,25000 1 

LL2U 

2.15 

0.55 

Kssa 

3 75 


Table 3: Shortest Path 


4.6 Huffman Tree Example 

This is another classic algorithmic problem and one 
of the original solutions was found by Huffman [7]. It 
hais been used extensively to perform data compression 
through the design and use of so called Huffman codes. 
These codes are prefix codes which are based on the 


OiWGINAi PAGE fS 
OF POOR QUALITY 


Paper 7.3 
207 















Huffman tree and which yield excellent data compres- 
sion ratios. The tree structure and the code design are 
based on the frequencies of individual characters in the 
data to be compressed. Here we are concerned exclu- 
sively with the Huffman tree. See [7] for information 
about the coding application. 

Definition 4.4 The Huffman tree problem is the fol- 
lowing: Given a sequence of frequencies (positive inte- 
gers) /[I], /[2], . . . , /[n], construct a tree with n leaves 
and with one frequency value assigned to each leaf so 
that the weighted path length is minimized. Specif- 
ically, the tree should minimize the following sum: 
Yli gLEAF where LEAF is the set of leaves, 

len(t) is the length of the path from the root of the tree 
to the leaf /, , f[i] is the frequency assigned to the leaf 

A full description of the method we employ to gener- 
ate and use a certification trail is detailed in a technical 
report [15]. 


Size 

Basic 

Prim. Exec. 
(Also Gen. 
TVail) 

Sec- 

Exec. 

H 

Speedup 

5000 

0.81 

0.87 

0.16 


5.06 

lOOOO 

1.76 

1.86 

0.33 

BtViEI:! 

5.33 

25000 



1.02 


5.89 



11.14 

1.70 


6.25 


Table 4: Huffman tree 


4.7 Other problems 

We report timing data for five other problems, the 
“Manhattan skyline” problem, computation of Voronoi 
diagrams, longest increasing subsequence, the closest 
pair problem, and line segment intersection. Space per- 
mits only a brief description of these problems, rather 
than a full exposition of the certification trail tech- 
niques used. 

The “Manhattan skyline” problem is: Given a set 
of rectangles with collinear bottom edges, compute the 
polygonal outline of the union of the rectangles [9]. 

The Voronoi diagram is a fundamental concept in 
computational geometry [11]. Given a set of points P 
in the plane, the Voronoi diagram is a partition of the 
plane into regions such that each region consists of all 
points closer to a given p £ P than to any other other 
point in P. Computation of the Voronoi diagram is 
an important step in many problems involving point 
location. 

The next problem we consider is, given a sequence 
of integers, find the longest (not necessarily unique) 
strictly increasing subsequence. 


Size 

Basic 

Prim. Exec. 
(Also Gen, 
Trail) 

Sec. 

Exec. 

Sav. 

Speedup 

1000 

0.27 

0.26 

0.12 

29.63 

2.25 

5000 

1.6'9 

1.65 

0.57 

34.32 

2.96 

10000 

3.91 

3.72 

1.14 

37.85 

3.43 

15000 

6.08 

5.78 

1.77 

37.91 

3.44 

20000 

8.53 

8.27 

2.33 

37.87 

3.66 


Table 5: Skyline 


Size 

Basic 

Prim. Exec. 
(Also Gen. 
Trail) 

Sec. 

Exec. 

% 

Sav. 

Speedup 

100 

0.04 

0.04 

0.03 

12.50 

1.33 

500 

0.24 

0.26 

0.19 

6.25 

1.26 

1000 

0.51 

0.51 

0.39 

11.76 

1.31 

5000 

2.75 

2.82 

2.03 

1 1 .82 

1.35 

10000 

5.79 

5-89 

4.06 

14.08 

1.43 

50000 

40.15 

40.63 

22.00 

22.00 

1.83 


Table 6: Voronoi Diagram 


Size 

Basic 

Prim. Exec. 
(Also Gen. 
Trail) 

Sec. 

Exec. 

% 

Sav. 

Speedup 

10000 

MSm 

0.14 

0.04 



50000 


0.81 




loooob 


1.70 

■dUJI 

ES&I 

3.66 

500000 


9.32 

2.22 

WSXiSM 

4.13 

1000000 

18.66 

19.58 

4.46 

35.58 

4.18 


Table 7: Longest Increasing Subsequence 


Given a set of points P in the plane, the Closest 
Pair problem is that of finding the pair of points with 
minimum distance over all pairs in the set. 


Size 

Basic 

Prim. Exec. 
(Also Gen. 
IVail) 

Sec, 

Exec, 

“IT" 

Sav. 

Speedup 

10000 

0.26 

0.27 

0.07 

34.62 

3.71 

50000 

1.45 

1 .55 

0.36 

34.14 

4.03 

100000 

3.06 

3.26 

0.72 

34.97 

4.25 

500000 

16.84 

18.02 

3.62 

35.75 

4-65 


Table 8: Closest Pair 


Given a set of line segments in the plane, the line 
intersection problem is the problem of determining all 
intersections of line segments in this set. 

For the first four problems, algorithms running in 
0(nlog(n)) time were implemented for the first execu- 
tion. The second execution, using certification trails, 
runs in linear time. The first execution algorithm used 
for line intersection runs in (0((t -f- n)log(n)) tim^ 
where k is the number of intersections and n the num- 
ber of points. The second execution runs in 0{k -f- n) 
time. Note that k may be quadratic in n. 


Paper 7.3 
208 






















-5ST- 

Basic 

Prim. Exec. 
(AUo Geu. 
Trail) 

.Sec. 

Exec. 

Sav. 

.Speedup 

nooo“ 

0.47 

0.49 

0.04 

43.62" 

11.75 

~T5oo~ 

1.45 

1.53 

0.12 ■ 

43.10 

12.08 

"loocT 

3.33 

3.47 

0.26 

43.99 

^ 12.81 

"10000 

^ 7.72 

7.88 

0.60 

45.08 

12.87 

■wood" 

24.00 

24.12 

1.75 

46.10 

13.71 


Table 9: Line Segment Intersection 


5 Concluding Discussion 

Certification trails have heretofore been discussed 
principally from a theoretical perspective. In this pa- 
_per we have presented experimental timing data which 
illustrates the advantages of the certification trail tech- 
Bique for software testing over the 2- version program- 

ming technique. We have further presented techniques 

“«nd analytical results for several new algorithms which 
further support the significance of the certification trail 
technique by demonstrating its broadening applicabil- 
_ity. It should be appreciated that the scope of our 
experimental investigation is not limited to the algo- 
■ithms considered here; numerous other algorithms we 
_iave considered could have been discussed, and we con- 
tinue to work on new applications. It should also be 
nointed out that in addition to the timing experiments 
eported here, software fault injection experiments have 
-also been conducted which verify the detection capabil- 
ities of the certification trail method. The breadth of 
pplicability of the certification trail technique contin- 
ues to expand along with the credibility of its advan- 
tages. Increasingly, the certification trail method can 
V viewed as a competitive software testing alternative. 


References 


{6] Graham, R. L., “An efficient algorithm for determining 
the convex hull of a planar set", Injormation Process- 
ing Letters, pp. 132-133, 1, 1972. 

[7] Huffman, D., “A method for the construction of min- 
imum redundancy codes", Proc. IRE, pp 1098-1101 
40, 1952. 

[8] Johnson, B., Design and analysis of fault tolerant dig- 
ital systems Addison- Wesley, Reading, MA, 1989. 

[9] Manber U., Introduction to Algorithms Addison- 

Wesley, Reading, MA, 1989. , 

[10] Nievergelt. J., and Hinrichs, K. H., Algorithms and i 

Data Structures With Applications to Graphics and ^ 

Geometry, Prentice Hall, NJ 1993 ] 

[11] Preparata F. P., and Shamos M. I., Computational ge- I 

ometry, Springer- Verlag, New York, NY, 1985. ■ 

ft 

[12] Sedgewick, R., “Implementing quicksort programs," \ 

Comm, of the ACM, pp. 847-857, 21(10), 1978. i 

[13] Siewiorek, D., and Swarz, R„ The theory and practice ' 

of reliable design. Digital Press, Bedford, MA, 1982. 

[14] Sulhvan, G.F., and Masson, G.M., “Using certification 
trails to achieve software fault tolerance," Digest of the 
1990 Fault Tolerant Computing Symposium, pp. 423- 
431, IEEE Computer Society Press, 1990. 

[15] Sullivan, G.F., and Masson, G.M., “Using certifica- 

tion trails to achieve software fault tolerance," De- 
partment of Computer Science Technical Report JHU 
89/26, Johns Hopkins University, Baltimore. Mary- 
land, 1989. ^ 

[16] Sullivan, G.F., and Masson, G.M., “Certification trails 
for data structures." Digest of the 1991 Fault Tolerant 
Computing Symposium, pp. 240-247, IEEE Computer 
Society Press, 1991. 


_1] Avizienis, A., “The N-version approach to fault toler- 
ant software.” IEEE Trans, on Software Engineering, 
vol. 11, pp. 1491-1501, Dec., 1985. 

_2] Chen L., and Avizienis A., “N-version programming; 
a fault tolerant approach to reUability of software op- 
wation,” 1978 Fault Tol. Comp. Symp., pp. 3-9, IEEE 
Computer Society Press, 1978. 

[3] Dijlmtra, E. W., “A note on two problems in connexion 
with graphs," Numer. Math. 1, pp. 269-271, 1959. 

_] Fortune, S. “A Sweepline Algorithm for Voronoi Dia- 
grams," Algorithmica, pp. 153-174, 2, 1987. 

J Fredman, M. L., and Willard, D. E., “TVans- 

^ dichotomous algorithms for minimum spanning trees 
Md shortest paths," Proc. SIst IEEE Foundations of 
Computer Science, pp. 719-725,1990. 


[17] Tarjan, R. E., Data Structures and Network Algo- 
rithms, Society for Industrial and AppUed Mathemat- 
ics, Philadelphia, PA, 1983. 


page fs 
OP POOR QUALITY 


Paper 7.3 
209 



N94- 36067 




V 



Experimental Evaluation of Certification TYails using Abstract 

Data Type Validation 


Dwight S. Wilson' 
Dept, of Computer Science 
Johns Hopkins University 
Baltimore, MD 21218 


Gregory F. Sullivan* 
Dept, of Computer Science 
Johns Hopkins University 
Baltimore, MD 21218 


Gerald M. Masson* 
Dept, of Computer Science 
Johns Hopkins University 
Baltimore, MD 21218 


Abstract 

Certification trails are a recently introduced and 
promising approach to fault*detection and fault- 
tolerance [11, 12]. Recent experimental work [13] 
reveals many cases in which a certification-trail ap- 
proach allows for significantly faster program execu- 
tion time than a basic time- redundancy approach. Al- 
gorithms for answer- validation of abstract data types 
are presented in [12] and allow a certification trail ap- 
proach to be used for a wide variety of problems. In 
this paper, we report on an attempt to assess the per- 
formance of algorithms utilising certification trails on 
abstract data types. Specifically we have appUed this 
method to the following problems; heapsort, Huffman 
tree, shortest path, and skyline. Previous results used 
certification trails specific to a particular problem and 
implementation. The approach in this paper allows 
certification trails to be localized to Mata structure 
modules,” making the use of this technique transpar- 
ent to the user of such modules. 

Keywords: Software fault tolerance, certification 
trails, error monitoring, design diversity, data struc- 
tures. 


1 Introduction 

To explain the essence of the certification trail tech- 
nique for software fault tolerance, we first discuss 2- 
version programming [4, 2]. Using 2- version (or more 
generally, JV-version) programming, two (or N) im- 
plementations of an algorithm are executed on a given 
input, and the results compared. If the outputs agree, 
they are accepted, otherwise an error is flagged. This 
technique wfll detect a variety of software faults as well 
as transient hardware faults. A variation of this tech- 
nique is to execute a single program twice and compare 

^Rwarch pvti^Uy supported by NSF Gr^tt CCR'S910569 
vid IBM Terimology Interchange Program Grant. 

’Research partially aupported by NSF Grants CCR.S910569 
and CCR.890809Z. 

’Research partiaUy supported by NASA Grant NSG 1442. 


results, this is called time redundancy. Although there 
are a few software faults that may be detected using 
time redundancy (e.g., uninitialised pointer errors), it 
is more effective in catching transient faults. 

The certification trail technique is designed to 
achieve similar types of error detection capabilities but 
expend fewer resources. The central idea, is to modify 
the first algorithm so that it leaves behind a trail of 
data which we call a ctriificaiion trail. The second 
algorithm may then make use of this data, which is 
chosen so that the algorithm executes more quickly 
and/or has a simpler structure than the first algo- 
rithm. As above, the outputs of the two executions 
arc compared and arc considered correct only if they 
agree. Note, however, we must be careful in defining 
this method or else its error detection capability might 
be reduced by the introduction of data dependency 
between the two algorithm executions. For example, 
suppose the first algorithm execution contains a er- 
ror which causes an incorrect output and an incorrect 
certification trail of data to be generated. Further sup- 
pose that no error occurs during the execution of the 
second algorithm. It appears possible that the execu- 
tion of the second algorithm might use the incorrect 
trail to generate an incorrect output which matches 
the incorrect output given by the execution of the first 
algorithm. Intuitively, the second execution would be 
“fooled” by the data left behind by the first execution. 
The definitions we give below exclude this possibility. 
They demand that the second execution either gener- 
ates a correct answer or signals the fact that an error 
has been detected in the data trail. 

Early work on the certification trail focused on cre- 
ating trails for specific implementation of problems. 
For example the trail given in [11] for the convex huD 
problem is specific to the Graham scan algorithm. In 
general, the two algorithms used, in this approach can 
be quite different. A more recent approach is to con- 
struct a certification trail for an abstract data typo. 
That is, given the answers to operations allowed on 
that type, our algorithm checks the correctness of 
these answers. This method has the advantage that 
the certification trail techniques are localised to the 


0730-3157/92 $3.00 © 1992 IEEE 

^AG£ BLANK mT KILMSD 


300 


ootinet implementing data itructuie operations, and 
nay then be applied to a wide variety of problems 
■“withont special coding. In many cases it may be pos- 
sible to use existing code with only minor modilica- 
ions. Code using these routines U run twice, the first 
i^.ime generating the trail, the second time using it. Al- 
ternately, the trail checking may be done, in parallel, 
'.e., we perform the checking as the trail is being gen- 
:rated. A programmer using a library of these routines 
need not be familiar with certification trail techniques. 
Object oriented programming techniques may be par- 
iculaily useful for implementation of such “certified” 
"Tlata types. 

—a Formal Definition of a Certification 
TraU 

^ In this section we will give a formal definition of a 
certification trail and discuss some aspects of its real- 
nations and uses. 

"IDefinition 2.1 A problem P is formalised as a rela- 
tion, i.e., a set of ordered pairs. Let D be the domain 
' [that is, the set of inputs) of the relation P and let 
i-nS be the range (that is, the set of solutions) for the 
problem. We say an algorithm A solves a problem P 
^ liT for all d G D when d is input to A then an s d S is 
^output such that (d, s) € P > 

Definition 2.2 Let P ; D S be a problem. A 
r solution to this problem using a certification trail con- 
c=rift# of two functions F\ and Fj with the following do- 
mains and ranges Fx : D — » S x T and Fj : D x T 
S U {error}. T is the set of certification trails. The 
;_functions must satisfy the following two properties: 

(1) for all d G D there exists s G S and there 

exists f G T such that 

— Fi(d) = (s,<) and Fj(d,f) = s and (d, s) G P 

(2) for all d G D and for all t G T 

either (Fa(d, <) = s and (d, s) € P) 

^ 2 or Fj(d, i) = error. 

We also require that Fi and Fj be implemented 
^so that they map elements which arc not in their re- 
spective dom^s to the error symbol. The definitions 
I above assure that the error detection capability of the 
l^certification trail approach is comparable to that ob- 
tained with the simple time redundancy approach dis- 
cussed earlier. (That is, if transient hardware faults 
occur during oidy one of the executions then either an 


error will be detected or the output will be correct.) 
It should be further noted, however, the examples to 
be considered will indicate that this new approach can 
also save overall execution time. 


S Answer Validation Problem for Ab- 
stract Data Types 

Our general approach to applying certification 
trails uses the concept of an abstract data type. Some 
examples of abstract data types are given later in this 
paper. Here we mention some important common 
properties and give a short illustration. Each abstract 
data type has a well defined data object or set of data 
objects. Each abstract data type has a carefully de- 
fined finite collection of operations that can be per- 
formed on its data objcct(s). Each operation takes a 
finite number of arguments (possibly lero). In addi- 
tion, some but not all operations return answers. An 
example of an abstract data type is a priority queue. 

The data object for a priority queue U an ordered pair 
of the form (i,k) where i is an item number and k is 
a key value. A priority queue has two operations: in- 
sert(i,k) and delmin. The insert operation has two 
arguments: item number i and key value k. The in- 
sert operation does not return an answer. The delmin 
operation has no arguments, but it does return an an- 
swer. The precise semantics of these operations are 
given later in this paper. 

For each abstract data type we may define an an- 
swer validation problem. Intuitively, the answer vali- 
dation problem consists of checking the correctness of 
a sequence of supposed answers to a sequence of op- 
erations performed on the abstract data type. More 
formally, the input to the answer validation problem 
is a sequence of operations on the abstract data type 
together with the arguments of each operation. In 
addition, the sequence contains the supposed answers 
for each of the operations which return answers. In 
particular, each supposed answer is paired with the 
operation that is supposed to return it. 

The output for the answer validation problem is the 
word “correct” if the answers given in the input match 
the answers that would be generated by actually per- 
forming the operations. The output is the word “in- 
correct” if the answers do not match. It is also useful 
to allow the output word to say “ill-formed". This out- 
put is used if the sequence of operations is ill- formed, 
e g., an operation has too many arguments or an ar- 
gument refers to an inappropriate object. 

The answer validation problem is similar to the idea 

^qinal 


301 


of an acceptance te«t which is used in the recovery 
block approach [10] to software fault tolerance. The 
main difference is that an answer validation problem 
if dependent upon a sequence of answers, not just an 
individual answer. Hence, if an incorrect answer ap- 
pears in the sequence, it may not be detected imme- 
diately. It is guaranteed, however, that an incorrect 
will be detected at some point during the processing 
of the entire sequence. By allowing for this latency in 
detection, it is possible to create a much more efficient 
procedure for solving the answer validation problem. 

The most important aspect of the answer validation 
problem is the fact that is is often possible to check the 
correctness of the answers to a sequence of operations 
much more quickly than actually calculating what the 
answers should be from scratch. In other words, the 
answer validation problem has a smaller time com- 
plexity than the original abstract data type problem. 
For example, to calculate the answers to a sequence 
of n priority queue operations takes n(nlog(n)) time 
in the decision tree model; however, it is possible to 
check the correctness of the answers in only 0(n) time 
[12], This speed is very useful in fault-detection ap- 
plications. 

It is possible to run an answer validation algorithm 
for some abstract data type concurrently with some 
algorithm which uses the abstract data type. The an- 
swer vabdation algorithm could act as a monitor mak- 
ing sure that all interactions with the abstract data 
type arc handled correctly. This is valuable because 
many algorithms spend a large fraction of their time 
operating on abstract data types. Note, the overhead 
of this monitor is less than the overhead of actually 
performing the data type operations twice. 


4 Schema for using Certification Trails 

Suppose that we have developed an efficient solu- 
tion to the answer validation problem for some ab- 
stract data type. By efficient we mean the time com- 
plexity of the answer validation problem is smaUer 
than the time complexity of the original abstract data 
problem. Further, suppose that we wish to run 
M algorithm, say A, which uses that abstract data 
^yp®- To apply the certification trail method we can 
use the following schema to yield the two executions; 

First Execution: 

Execute algorithm A. 

Each time an abstract data type operation is per- 
formed. Append to the certification trail the identity 
of the operation, the arguments and the answer. 


Second execution; 

Phase One: 

Validate the correctness of the operations and sup- 
posed answers given in the certification trail. If the 
validation returns “incorrect" or “ill-formed" then 
output “error" and stop. Otherwise, continue. 

Phase Two: 

Execute algorithm A. 

Each time an abstract data type operation is per- 
formed. Read the next entry in the certification trail. 
Make sure that the operation and the arguments in the 
certification trail agree with those requested in the al- 
gorithm. If not output “error" and stop. Otherwise, 
use the answer given in the certification trail and con- 
tinue. 

This schema can yield execution times which are 
significantly faster than the execution time obtained 
by running algorithm A twice. Yet the schemes yield 
comparable fault detection capabilities. Note, the first 
execution can be slower than a simple execution of al- 
gorithm A smee it must output a certification trail. 
However, the second execution can be significantly 
faster than a simple execution of the algorithm since 
the interactions with the abstract data type take less 
time overall. The net effect can yield a major speed- 
up. 

Suppose an algorithm uses multiple abstract data 
types and suppose there are efficient answer validation 
algorithms for each of these abstract data types. It is 
easy to sec how our method generalises. We can leave 
behind a generalised certification trail which consists 
of a sepeiate certification trail for each of the abstract 
data types. The effect on the speed up of the second 
execution will be cumulative. 


5 Generalized Priority Queue 

We now describe a somewhat general abstract data 
type. We arc able to solve the answer validation prob- 
lem for restricted versions of this data type. The data 
consists of a set of ordered pairs. The first element in 
these ordered pairs is referred to as the item number 
and the second element is caUed the key value. Or- 
dered pairs may be added and removed from the set, 
however, at all times the item numbers of distinct or- 
dered pairs must be distinct. It is possible, though, 
for multiple ordered pairs to have the same key value. 
In this paper the item numbers are integers between 
1 and n, inclusive. Our default convention is that t is 


302 


an it«m number, li is a key value and is a let of or- 
dered pairs. A total ordering on the pairs of a set can 
— be defined lexicographicaOy as follows: (t, li) < 

iS h < V Of {h = V and i < *'). The abstract data 
types we will consider support a subset of the foUowing 
^ operations. 

member(t) returns a boolean value of true if the set 
contains an ordered pair with item number t, oth- 
erwise returns false. 

insert(f , h) adds the ordered pair (i, is) to the set. We 
require that no other pair with item number t be 
^ in the set. 

delete(i) deletes the unique ordered pair with item 
number i from the set. We require that a pair 
. with item number i be in the set initisdly. 

changekey(t, is) is executed only when there is an or- 
^ dered pair with item number t in the set. This 

^ pair is replaced by (t, fc). 

deletemin returns the ordered pair which is smallest 
according to the total order defined above and 
^ deletes this pair. If the set is empty then the 
token *empty” u returned. 

mifi returns the ordered pair which is smallest accord- 
" ing to the total order defined above. If the set is 
empty then the token “empty" is returned, 

^ . miLT nnd deletemax these operations are similar to 
** min and deletemin, using the largest element in- 
stead of the smallest one. 

~ If an operation violates one of the requirements dc- 
"" scribed above then it is considered to be ill- formed. 

Also, if an operation has the wrong number or type of 
^ arguments it is considered to be ill- formed. 

^ Many difierent types and combmations of data 
structures can be used to support different subsets of 
r T these operations efficiently. Specifically we are inter- 
^ ested in allowing the insert, delete, min, and deletemin 
operations. It b possible to process a sequence of 0(n) 
operations in O(nlog(n)) with implementations using 
rr heaps or balanced search trees such as AVL trees [1], 
^ red-black trees [fi] or b- trees [3). Answer validation 
of these operations can be performed in 0(n) time 

[ 12 . 13 ]. 

0 Examples of the use of Data Struc- 
_ ture Certification 

In this section we evaluate the use of certification 
trails for data structures as applied to four well-known 


and significant problems in computer science: sorting, 
the shortest path tree problem, the Huffman tree prob- 
lem, and the skyline problem. We have implemented 
basic algorithms for these problems and algorithms 
which generate and use certification trails. Uming 
data was collected using a SPARCstation ELC. 

The timing information reported in the tables con- 
sists of the run time of the basic algorithm (i.e., no 
certification trail), the run time of the trail-generating 
algorithm, the run time of the trail-using algorithm, 
the percentage savings of using certification trails, and 
the speedup achieved by the second phase of the certi- 
fication trail method. The percentage savings is com- 
puted by comparing the total run time of algorithms 
for generating and using trails against twice the run 
time of the basic algorithm. The speedup is computed 
by dividing the run time of the basic algorithm by the 
run time of the algorithm that uses the certification 
trail. 

Apart from the data structures, the implementa- 
tion of both phases of the certification trail version of 
each algorithm is nearly identical to the implementir 
tion of the basic version. The only difference in the 
code for the two phases is a parameter passed to the 
data structure code indicating whether a certification 
trail should be generated or used. All code implement- 
ing the certification trails is localised to the modules 
implementing the data structures, aDowing the gener- 
ation and use of the trail to be transparent to the user 
of these modules. Due to space constraints only an 
abbreviated discussion of the algorithms is given. 

6.1 Heapsort 

Sorting b a fundamental operation in computer sys- 
tems, and there eibt several sorting algorithms. Sort- 
ing may be implemented with a priority queue (or 
more specifically, a heap) by inserting all elements 
and performing deletemin operations until the queue 
b empty. 

Input data was generated by creating sets of inte- 
gers chosen uniformly from the mterval [0, 10000000). 
Timing results arc based on fifty executions at each 
input sUe. 

0,2 Huffman Tree 

Given a sequence of ftequencies (positive integers), 
we wbh to construct a Huffman tree, i.e., a binary tree 
with frequencies assigned to the leaves, such that the 
sum of the weighted path lengths b minimbed. Thb 
b a classic algorithmic problem and one of the ongmal 
solutions was found by Huffman [7]. It has been used 


303 


OWQINAL PAQ€ 15 
or POO 


♦ 





BS9 

Tfil 

Um 

Trail 

% SaTiaf 


mvyyjm 

0.44 

0.41 

■201 

se.se 

4.00 


O.M 

1.00 

KS9 

S7.24 

4.2S 


S.71 

S.SO 


S7.27 

4.S2 

[ 100000 

S.S7 

e.os 

■m 

S7.SS 

4.77 

miyyAm 

lltl 

t>.«l 

2.47 

S9.S0 

S.ll 

1 sooooo 

Islet 

S0.2S 

sTs” 

SS.04 

S.27 


T^ble 1: Heapsort 


Sim 

WBMI 

Generate 

TVail 

Us« 

Trail 

% SaTing 

Spaedop 

SOOO 

o.se 

0.41 

Km 

27.es ” 

2.71 

10000 

o.ss 

0.S7 

K&iJi 

S0.12 

2.se 

20000 

1.7S 

l.SO 

keu 

20.80 

2.0S 


4:51 — 

S.SO 

mwrm 

soTs 

S.22 


10.71 

11.47 

Km 

S2.14 

S.4S 

muyyjM 

lelto 

17:8? 

KSU 

S2.S4 

S.SO 


I^ble 2: HufTman Tree 


extensively in data compression algorithms through 
the design and use of so called Huffman codes. The 
tree structure and code design are based on frequencies 
of individual characters in the data to be compressed. 
In this paper we are concerned only with the Huff- 
man tree, the interested reader should consult [7] for 
information about the coding application. 

The Huffman tree is built &om the bottom up and 
the overall structure of the algorithm is based on the 
greedy Merging” of subtrees. An array of pointers, 
ptr, is used to point to the subtrees as they are con- 
structed. Initially, n single vertex subtrees are con- 
structed, each one associated with a frequency num- 
ber in the input. The algorithm repeatedly merges the 
two subtrees with the smaUest associated frequency 
values, assigning the sum of these frequencies to the 
resulting tree. A priority queue data structure allows 
the algorithm to quickly find the subtrees to merge at 
each step. 

Data for the timing experiments was generated by 
choosing integer frequencies uniformly from the range 
{0, 100000]. Timing results are based on fifty execu- 
tions for each input sise. 

6.S Shortest Path 

Given a graph with non-negative edge weights and 
a source vertex, we wish to find the shortest paths 
from the source vertex to each of the other vertices. 
This is another classic problem and has been examined 
extensively in the literature. Our approach is applied 
to Dykstra’s algorithm. 

D^kstra’s algorithm is a greedy algorithm. At each 
step, there exists a set of vertices S to which shortest 
paths are known, and a set T of vertices adjacent to 
members of this set. The best paths known to mem- 


51 ^ 

Basic 

Algorithm 


m*fgm 

liH 

ipaadap 


O.IS 

0.14 

0.M 

ss.ss 



O.SS 

0.S2 

kuk 

TTfl 



O.se 

0.S2 

msm 

se.et 


1000.10000 

0.70 

0.7S 

K2is 

S7.07 


2000,20000 

1.74 

i.es 


87.84 


2S00.2S000 

2.22 

2.08 


S8.SI 



Thble 3: Shortest Path 


bers of T are examined, and the vertex v, with the 
minimum path length is removed from T and added to 
5. A data structure that supports insert, delete, and 
deletemin can be used to implement this algorithm. 

Input graphs of \V\ vertices and \E\ edges were gen- 
erated by choosing a set of \E\ distinct edges uniformly 
from all possible such sets, then rejecting graphs that 
were not connected. |£| was chosen sufficiently large 
that each selection is connected with high probability, 
resulting in few rejections. The input sises were cho- 
sen to keep the ration |£|/|V'| constant, for in practice 
the running time of the algorithm is affected by this 
ratio. Timing results are based on fifty executions at 
each input siie. The sise column of TtAlt 3 contains 
an ordered pair indicating the number of vertices and 
edges. 


6.4 Skyline 

Given a set of rectangles with with collinear bot- 
tom edges, the skyline u the figure resulting from re- 
moving all hidden edges. The problem of computing 
the skyline of a set of rectangular buildings by elim- 
inating hidden lines is discussed in [8]. The method 
used is divide and conquer and it constructs a sky- 
line in O(nlog(n)) time. In this paper we use a plane 
sweep algorithm that can be easily implemented in 
terms of operations on priority queues. Plane sweep 
algorithms are widely used for computational geom- 
etry problems [9], and typically use a priority queue 
for event scheduling, and may be amenable to use of 
certification tr^ techniques. 

Using a plane sweep algorithm, we compute the 
skyline as foUows. Initialise a vertical sweep line to 
the left of all the rectangles (we may assume that all 
rectangle are to the right of the y-axis). As we sweep 
the line to the right we maintain a collection of the 
heights of the rectangles encountered. For each rect- 
angle Ry the height of i2 is added to the collection 
when we encounter U’s left edge and removed when 
we encounter its right edge. The height of the skyline 
at any point «0i ^ the maximum height in the collec- 
tion when the sweepline is at z =: Zq. Details are given 
below. A structure supporting insert and deletemm iS 


304 










































































all that is needed to oidei the events, and a structure 
— supporting insert, max, and delete is required to store 
the rectangle heights. A priority queue (supporting 
insert and can be used to order the sweepline events, 
^ and a generalised priqrity queue to store the rectangle 
heights. 

^ Input data was generated by choosing integral rect- 
y angle heights uniformly over the range [0, 100000]. 

The x-coordinates of the left edges were chosen uni- 
^ formly over the range [0, 90000] and the width of 
^ rectangle was chosen uniformly over the range 
“ [1, 10000]. Timing results are based on twenty execu- 
tioas for each input siMt, 


7 Conclusions 

The experimental data in this paper shows the util- 
ity of the certification trail approach using abstract 
^data types. This paper supplements [13] which pro- 
vides experimental data illustrating the advantages of 
implementation specific certification trails over classi- 
cal time redundancy. We have shown that the more 
— general approach of checking abstract data types also 
provides performance superior to classical time redun- 
dancy. Thin is significant because a wide range of al- 
__gorithms may be represented as a sequence of oper- 
ations on abstract data types. The certification trail 
approach may therefore be used on these programs, 

_ .without reqi^g per problem “ad hoc” techniques. 
—Creation of library routines or class libraries for these 
daU types allows the certification trail technique to be 
ns^ transparently, and may allow it’s use with only 
— minor modifications of existing code. 


—References 

^ [1] Adel’son-Vel’skii, G. M., and Landis, E. M., “An 
algorithm for the organisation of information”, 
SovUi Math. DoU., pp. 1259-1262, 3, 1962. 


[2] Avisienis, A., “The N -version approach to fault 
tolerant software,” IEEE Tran$. on Software En- 
gineering, vol. 11, pp. 1491-1501, Dec., 1985. 

[3] Bayer, R., and McCreight, E., “Organisation of 
large ordered indexes”, Acta Inform., pp 173-189, 
1. 1972. 

[4] Chen, L., and Avisienis A., “N -version program- 
ming; a fault tolerant approach to reliability 
of software operation,” Digest of the 1978 Fault 
Tolerant Computing Symposium, pp. 3-9, IEEE 
Computer Society Press, 1978. 

[5] Gabow, H. N., and Tagan, R. E., “A linear-time 
algorithm for a special case of disjoint set union,” 
/. of Comp, and Sgs. Sei., 30(2), pp. 209-221, 
1985. 

[6] Guibas, L. J., and Sedgewick, R., “A dichromatic 
framework for balanced trees”. Proceedings of the 
Nineteenth Annual Symposium on Foundations 
of Computing, pp. 8-21, IEEE Computer Society 
Press, 1978. 

[7] Huffman, D., “A method for the construction 
of minimum redundancy codes”, Proe. IRE, pp 
1098-1101,40,1952. 

[8] Mwber D., Introduction to Algorithms: A Cre- 
ative Approach Addison-Wesley, Reading, MA. 
1989. 

[9] Preparata F. P., and Shamos M. I., Compu- 
tational geometry: an introduction, Springer- 

Verlag, New York, NY, 1985. 

[10] Randell, B., “System structure for software fault 
tolerance,” IEEE TYans. on Software Engineer- 
ing, vol. 1, pp. 220-232, June, 1975. 

[11] Sullivan, G.P., and Masson, G.M., “Using cer- 
tification trails to achieve software fault toler- 
ance," Digest of the 1990 Fault Tolerant Com- 
puting Symposium, pp. 423-431, IEEE Computer 
Society Press, 1990. 

[12] SuUvan, G.F., and Masson, G.M., “Certification 
trails for data structures,” Digest of the 1991 
Fault Tolerant Computing Symposium, pp. 240- 
247, IEEE Computer Society Press, 1991, 

[13] Sullivan, G.F., Wilson, D.S., Masson, G.M., Itoh, 

M., Smith, W.S., Kay, J.S., “Experimental eval- 
uation of the certification trail method,” Techni- 
cal Report, Computer Science Department, The 
Johns Hopkins University 


305 







United States Patent un 

Masson et al. 



US005i43607A 


[II] Patent Number: 5,243,607 

(45) Date of Patent: Sep. 7, 1993 


[54] METHOD AND APPARATUS FOR FAULT 
TOLERANCE 

[75] Inventors: Gcnld M. Masson; Gregory F. 

SuliiftA, both of Baltimore, Md. 

[73] Assignee: The Johns Hopkins UnlTersity, 

BaUtmore, Md. 

[21] Appl. No.: 543,451 


[22] Filed: Jim. 25, 1990 

[51] Inta.5 H04L1/M 

[52] U.S. a 371/69.1; 371/63.3; 

371/63.1; 371/19; 395/575 


[58] Field of Search 371/69.1. 63.3, 63.1, 

371/19, 15.1, 16.1, 67.1; 364/200 MS File; 

395/575 

[56] References Qted 

U S. PATENT DOCUMENTS 

4.696.003 9/1987 Kerr 371/69.1 X 

4,756.005 7/1933 Shedd 371/69.1 X 

5.005,174 4/1991 Bnickert el al 371/63.3 

OTHER PUBLICATIONS 

H. Geng, “Circuit for the Complete Check of a Data— 
Processing System”, IBM TDB, vol. 16, No. 4, Sep. 
1974, pp. 1144-1145. 

K- Knowlton, "A Combination Hard ware* Software 


Debugging System.” IEEE Transactions on Comput- 
ers, Jan. 1968, pp. 3U36. 

Primary Examintr — Robert W. Beausoliel, Jr. 

Assistant £xj/m>ier— Ly V. Hua 
Attomty, Agtnu or F/rm— Ansel M. Schwartz 

[57] ABSTRACT 

A method and apparatus for achieving fault tolerance in 
a computer system having at least a first central process- 
ing unit and a second central processing unit. The 
method comprises the steps of fint executing a first 
algorithm in the first ccntiaJ processing unit on input 
which produces a first output as well as a certification 
trail. Next, executing a second algorithm in the second 
central processing unit on the input and on at least a 
portion of the cerdficatioo trail which produces a sec- 
ond output. The second algorithm has a faster execution 
time than the first algorithm for a given input. Then, 
comparing the first and second outputs such that an 
error result is produced if the first and second outputs 
are not the same. The step of executing a first algorithm 
and the step of executing a second algorithm preferably 
takes place over essentially the sapie time period. 

18 Claims, 6 Drawing Sheets 




U.S. Patent 


Sep. 7, 1993 


Sheet 1 of 6 


5,243,607 


N 

ly 



FIO, 1 


Algorithm MINSPAN(G, weight) 

Input > Conntded groph G ■ (V, E) where V « ll .. ..n| with edge weights. 
Output' Sponning iree'of G which hos minimum weight 

1 choose root tV 

2 FOR ALL 0 € V, key(u)»«oo END FOR 

3 h:»0; v:» root 


4 

5 

6 

7 

8 

9 

10 

11 

12 
13 


WHILE V |i empty 00 
tey(v) :• —CD 

FOR each lv,w) « E DO 
IF weight It v.wlK key tw) THEN 
key |w):» weight (tv,w]);prefer (wj; « Cv.wl 
IF member (w,h) THEN chongekey (Wpkey(w).h) 
ELSE insert (w.key(w).h) END IF 
END IF 
END FOR 

(v,k) : « deletemin (h) 


14 END WHILE 

15 FOR ALL o«V-froot}.OUTPuTtprefer(u))£NO FOR 

END MINSPAN ^ ‘ 


FIG. 3 



U.S. Patent 


Sep. 7, 1993 


Sheet 2 of 6 


5,243,607 




FIG.2(e) m 


200 


FlG.2(f) rn _ 




U.S. Patent 


Sep. 7, 1993 


Sheet 3 of 6 


5,243,607 



FIG. 4 (a) fiQ^ 4(1,) 


Algorithm HUFFMAN (FREQ) . 

Input: Sequence of positive integers FREO.»|fllJtfi2J,..., 
Output: Pointer too Huffmon tree for the input frequencies 



1 FOR i 5 • 1 to n 00 


2 insert (i,f [i],h) 


S ptr Ci]: • ollocoteO 

4 info[ptr[iI]:*(i.f[i]) 

5 END FOR 

6 FOR)'«n+1 to 2n-l DO 

7 litemi, keyl): • deleteminlh) 

8 titem 2 , key2): ■ deletemin (h) 

9 ptr (jl: • ollocoteO 

10 infotptrtjll: *lj,key1* key2) 

11 left tptrC)]]: s ptr [item 1] 
rightlptr tjl3*“ptr (item 23 

13 insert 0, key 1 ♦ key 2, h) 

14 END FOR 

15 OUTPUT iptr l2n-ll) 

END HUFFMAN 


FIG. 5 



U.S. Patent 


Sep. 7, 1993 


Sheet 4 of 6 


5,243,607 



Algorilhm CONVEXHULUS) ^ 

1 Let pt be the point with the lorgeit . coordinote (end J 

2 For eoch point p lewept pt) colcuiote the tlopt ’J** *^'®“'** 1^, fK,-**-? j>n 

3 Ut the point, leicept pt) from the .molle.t .tope to the torge.t. Cotl themp2....pn 

4 qt:«p1;q2:>p2;q3‘«p3; m«3 

5 FOR h • 4 to n 00 > ISO degree. 00 m m«t ENO FOR 

€ WHILE the ongle formed by qm-1,qm,pk i. t •» 

7 m ;• m^ 1 


b qm :• pe 

9 ENO FOR , ^ 

10 FOR i« 1 to m 00, OUTPUT (qi) ENO FOR 
END CONVEXHULL 


F/6. 7 



U.S. Patent 


Sep. 7, 1993 


Sheet 5 of 6 


5,243,607 



FIG. 8(a) FIG. 8(b) FIG. 8(c) 




FIG. 9 




U.S. Patent 


INPUT 


Sep. 7, 1993 

Sheet 6 of 6 

FIRST CENTRAL 
PROCESSING UNIT 

FIRST OUTPUT 

FIRST 


ALGORITHM 



5,243,607 


CeRTIFiCATiON 

TRAIL 


COMPARE 


SECOND CENTRAL 
PROCESSING UNIT 


SECOND OUTPUT 


SECOND 

ALGORITHM 


FIG, 10 



FIG. n 
















1 


5,243.607 


2 

ing i first algorithm and the step of executing a second 
METHOD AND APPARATUS FOR FAULT algorithm preferably lakes place over essentially the 

TOLERANCE period. 


LICENSES 

The United States Government has a paid-up non- 
exclusive license to practice the claimed invention 
herein as per NSF Grant CCR-89 10569 and NASA 
Grant NSG 1442. 

FIELD OF THE INVENTION 

The present invention relates to fault tolerance. More 
apecifK^ly, the present invention relates to a first algo- 
rithm that provides a certification trail to a second algo- 
rithm for fault tolerance purposes. 

BACKGROUND OF THE INVENTION 

Traditionally, with respect to fault tolerance, the 
specification of a problem is given and an algorithm to 
wive it is constructed. This algorithm is executed on an 
input and the output is stored. Next, the same algorithm 
is executed again on the same input and the output is 
compared to the earlier output. If the outputs differ then 
an error is indicated, otherwise the output is accepted as 
correct. This software fault tolerance method requires 
additional time, so called time redundancy (Johnson. B., 
Design and analysis of fault tolerant digital systems, 
Addison-Wesley. Reading Mass.. 1989; Siewiorek, D., 
and Swarz, R., The theory and practice of reliable de- 
sign. Digital Ftess. Bedford, Mass.. 1982]; however, it 
requires not additional software. It is particularly valu- 
able for detecting errors caused by transient fault phe- 
nomena. If such faults cause an error during only one of 
the executions then either the error will be detected or 
the output will be correct. 

A variation of the above method uses two separate 
algorithms, one for each execution, which have been 
written independently based on the problem specifica- 
tion. This technique, call N-version programming 
[Chen. L., and Avizienis A., “N-version programming: 
a fault tolerant approach to reliability of software oper- 
ation,” Digest of the 1978 Fault Tolerant Computing 
Symposium, pp. 3-9, IEEE Computer Society Press, 
1978; Avizienis, A., “The N- version approach to fault 
tolerant software,” IEEE Trans, on Software Engineer- 
ing, vol. 11, pp. 1491-1501, December, 1985] (in this 
case N = 2), allows for the detection of errors caused by 
some faults in the software in addition to those caused 
by transient hardware faults and utilizes both time and 
software redundancy. Errors caused by software faults 
arc detected whenever the independently written pro- 
grams do not generate coincident errors. 

SUMMARY OF THE INVENTION 

The present invention pertains to a method for 
achieving fault tolerance in a computer system having 
at least a first central processing system and a second 
central processing system. The method comprises the 
steps of first executing a first algorithm in the first cen- 
tral processing unit on input which produces a first 
output as wen as a certification trail. Next, executing a 
second algorithm in the second central processing unit 
on the input and on at least a portion of the certification 
trail which produces a second output. The second algo- 
rithm has a fttster execution time than the first algorithm 
for a given input. Then, comparing the first and second 
outputs such that an error result is produced if the first 
and second outputs are not the same. The step of execut- 


The present invention also pertains to a method for 
^ achieving fault tolerance in a central processing unit. 
The method comprises the steps of executing a first 
algorithm in the central processing unit on input which 
produces the first output as well as a certification trail. 
IQ Then, there is the step of executing a second algorithm 
in the central processing unit on the input and on at least 
a portion of the certification trail which produces a 
second output. The second algorithm has a faster execu- 
tk>n time than the first algorithm for a given input. 
15 Then, there is the step of comparing the first and second 
outputs such that an error result is produced if the first 
and second outputs are not the same. 

The present inventioo also pertains to a computer 
system. The computer system comprises a first com- 
^ puter. The firtt computer has a first memory. The first 
computer also has a first central processing unit in com- 
municatioo with the memory. The fust computer addi- 
tionally has a 6rst input port in communication with the 
memory in the first central processing unit. There is a 
first algorithm disposed in the first memory which pro- 
duces a first output as weO as a certification trail based 
on input received by the input port when it b executed 
by the first central processor. The computer system is 
X) additionally compri^ of a second computer. The sec- 
ond computer is comprised of a second memory. The 
second computer is also comprised of a second central 
processing unit in communication with the memory and 
the first central processing unit. The second computer 
35 additionally b comprised of a second input port in com- 
munication with the memory in the second central pro- 
cessing unit. There b a second algorithm dbposed in the 
second memory which produces a second output based 
on the input and on at least a portion of the certificacion 
^ trail when the second algorithm b executed by the sec- 
ond central processing unit The second algorithm has a 
faster execution time than the first algorithm for a given 
input. The computer system b also comprised of a 
mechanbm for comparing the first and second outputs 
such that an error result b produced if the first and 
second outputs are not the same. 

Moreover, the present invention also pertains to a 
computer. The computer b comprised of a memory. 
50 Additionally, the computer b comprised of a central 
processing unit in communication with the memory. 
The computer b additionaDy comprised of a ftrst input 
port in communication with the memory and the central 
processing unit. There b a firtt algorithm dbposed in 
the memory which produces a fust output as weD as a 
certiiication trail ba^ on input received by the input 
port when the input b executed by the first central 
processor. There b a second algorithm also disposed in 
^ the memory which produces a second output based on 
the input and on at least a portion of the certificatioo 
trail when the second algorithm b executed by the cen- 
tral processing unit, llie second algorithm a faster 
execution time than the first algorithm for a given input 
^ Moreover, the computer b comprised of a mcchaidsm 
for comparing the first and second outputs such that an 
erfor result b produced if the first and second outputs 
are not the same. 



5,243,607 


BRIEF DESCRIPTION OF THE DRAWINGS 

In ihc »ccomp*nying drawings, the preferred em- 
bodimenu of the invention and preferr^ methods of 
practicing the invention are illustrated in which: 

FIG. 1 1 $ a block diagram of the present invention. 

FIGS. 2A through FIG. 2F shows an examples of a 
minimum spanning tree algorithm. 

FIG. 3 with the source code for a mince man algo- 
rithm. 

FIG. 4A and 4B shows an example of a data structure 
used in the second execution of a mince man algorithm. 

FIG. 5 with the source code for a Huffman algo- 
rithm. 

FIG. 4 shows an example of a Huffman tree. 

FIG. 7 with the source code for Graham’s scan algo- 
rithm. 

FIG. SA through FIG. iC shows a convex hull exam- 
ple. 

RG. 9 is a block diagram of an apparatus of the 
present invention. 

FIG. 10 is a block diagram of another embodiment of 
the present invention. 

FIG. 11 is a block diagram of another embodiment of 
the present invention. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

The central idea of the present invention, essentially a 
fault tolerance mechanism, as illustrated in RG. 1, is to 
modify a first algorithm so that it leaves behind a trail of 
data which is called a certification trail. This data is 
chosen so that it can allow a second algorithm to exe- 
cute more quickly and/or have a simpler structure than 
the first algorithm. The outputs of the two executions 
arc compared and arc considered correct only if they 
agree. Note, however, care must be ttken in dcfi^g 
this method or else its error detection capability might 
be reduced by the introduction of data dependent be- 
tween the two algorithm executions. For example, sup- 
pose the first algorithm execution contains a error 
which causes an incorrect output and an incorrect trial 
of data to be generated. Further suppose that no error 
occurs during the execution of the second algorithm. It 


10 


15 


20 


25 


30 


35 


40 


all d € D when d is input to A then an $ e S is output such 

that (d.s) € P. . , -r u- 

Definition 2.2 Let P D • S be i problem. Let T be 
the set of certification trails. A solution to this problem 
using a certification trail consisu of two functions F| 
and F: with the following domains and ranges F|:D 
S X T and Fj:D x T — SU cnor. The functions must 
satisfy the following two properties: 

(1) for all d c D there exists s c S and there exists i c 
T »uch that Fi(d) = (t,t) and Fj(d.i) = » and (d.i) < P 

(2) for all d e D and for all U T either (Fj(d,t) « s and 
(d,s) c P) or F 2 (d,i) » error. 

The definitions above assure that the error detection 
capability of the certification trail approach is compara- 
ble to that obtained with the simple time redundancy 
approach discussed earlier. That is, if transient hwd- 
ware faults occur during only one of the executiow 
then either an error will be detected or the output will 
be correct It should be further noted, bowevw, the 
examples to be considered will indicate that this new 
approach can also save overall execution time. 

The certification trial approach also allows for the 
detection of faults in software. As in N-vcrsion pro- 
gramming, separate teams can write the specification 
now must include precise information describing the 
generation and use of the certification trial. Because of 
the additional dau available to the second execution, 
the specifications of the two phases can be very differ- 
ent; similarly, the two algorithms used to tmplerocnl llw 
phases can be very different This will be illustrated in 
the convex hull example to be considered later. Altema- - 
lively, the two algorithms can be very similar, differing 
only in data structure mampulations. This will be illus- 
trated in the minimum spanning tree and Huffman tree 
examples to be considered later. When significantly 
different algorithms are used it is sometimes possible to 
save programming effort by sharing program code. 
While this reduces the ability to detect errors in the 
software it docs not change the ability to detect tran- 
sieni hardware errors as discussed earlier. 

With respect to the above, it has been assumed that 
our method is implemented with software; however, it 
is clearly possible to implement the certification trail 
technique by using dedicated hardware. It is also possi- 


cx^urs ounng me cxccuuon oi me - . 

5 tUI appears possible that the execution of the second 45 ble to gcncralire the basic iwo-lcvel 

i11u«trmt£d in FIG. 1 tO 


algorithm might use the incorrect trail to generate an 
incorrect output which matches the incorrect output 
given by the execution of the first algorithm. Intu- 
itively, the second execution would be “fooled" by the 
daU left behind by the first execution. The definitions 50 
given below exclude this possibility. They demand that 
the second execution cither generates a correct answer 
or signals the fact that an error has been detected in the 
data trail. Finally, it should be noted that in RG. 1 both 
executions can signal an error. These cnors would in- 
clude run-time errors such as divided-by-zero or non- 
terminating computation. In addition the second execu- 
tion can signal error due to an incorrect certification 
trail. The fault tolerance means can be used in hardware 
or software systems and manifested as firmware or soft- 
ware in a central processing unit 

A formal definition of a certification trail is the fol- 
lowing. 

Definition 2.1. A problem P is formalized as a relation 
(that is, a set of ordered pairs). Let D be the domain 65 
(that is, the set of inputs) of the relation P and let S be 
the range (that is, the set of solutions) for the problem. 

It can be said an algorithm A solves a problem P if for 


55 


60 


certification trial approach as illustrated in FIG. 1 to 
higher levels. 

Examples of the Certification Trail Technique 

In this section, there is Ulustraicd the use of certifica- 
tjon trails by means of applications to three well-known 
and significant problems in computer science: the mini- 
mum spanning tree problem, the Huffman tree problem, 
and the convex hull problem. It should be stressed here 
that the certification trail approach is not limited to 
these problems. Rather, these algorithms have been 
selected only to give illustrations of this technique. 

Minimum Spanning Tree Example 

The Twitiitnum spanning tree problem has been exam- 
ined extensively in the literature and an hbtorical sur- 
vey is given in [Graham, R.L., “An efficient algorithm 
for determining the convex hull of a planar set". Infor- 
mation Processing Letters, pp. 132-133, 1, 1972). The 
ccrtificatioa trial approach is applied to a variant of the 
Prim/Dijkstra algorithm JPrim, R.C., “Shortest con- 
nection networks and some generalizations,: Bell Sysi. 
Tech. J., pp. 1389-1401, November. 1957; Dijkstra, E. 


5,243.607 

s 

W., “A noie on two problems in conncwon with 
graphs/' Numcr. Math. 1. pp. 269-1984, Jun. 20-22] as 
explicated in [Tarjan, R.E.. Data Structures and Net- 
work Algorithms, Society for Industrial and applied 
Mathematics, Philadelphia, Pa. I983J. The discussion of 
the application of the certification trail approach to the 
minimum spanning tree problem beings with some pre- 
liminary definitions. 

Definition 3.1. A graph G » (V,E) consists of a ver- 
tex set V and an edge set E. An edge is an uoordered 
pair of distinct vertices which is notated as, for example, 

[v,w], and it is said v is adjacent to w. A path in a graph 
from V I to Vik is a sequence of vertices vj, V 2 , . . . , vjtsuch 
^t [vt, V] K i] is an edge for i € [1, . . . , k - 1}. A path 
is a cycle if k > 1 and vi s v^. An acyclic graph is a 
graph which contains no cycles. A connected graph is a 
graph such that for all pairs of vertices v,w there is a 
path from v to w. A tree is an acyclic and connected 
graph 


In our case, there is used two difTerent dau structure 
methods to support these operations. One method will 
be used in the first execution of the algorithm and an- 
other, faster and simpler, method will be used in the 
second execution. The second method relies on a trail of 
data which is output by the first execution. 

MINSPAN ALGORITHM 

Before discussing precise implementation details for 
these methods the overall algorithm used in both execu- 
tions is presented. Pidgin code for this algorithm ap- 
pears below. In addition, FIG. 2 illustrates the cxecu- 
tioa of the algorithm on a sample graph and the table 
below records the dau structure operations the algo- 
rithm must perform when run on the sample graph. The 
fist column of the uble gives the opentions except 
member and the parameter h dropped to reduce clutter. 
The second column gives the evolving contents of h. 
u column records the ordered pair deleted by 

Definiuon 3.2. Let O - (V.E) ^ » the deletemin operatioo. The foufth col^ records to 

certification traD corresponding to these operations and 


10 


15 


a positive rational valued function defined on E. A 
subtree of G is a tree, T(V',E'), with V' C. V and E’ C 
£. It is said T sp>ans V' and V is spanned by T. If V* ■« 

V then wc say T is a spanning tree of G. The weight of 
this tree is le (£*w(e). A minimum spanning tree is a 25 
spanning tree of minimum weight. 

Dau Structures and Supported Operations 

Before discussion of the minimum spanning tree algo- 
rithm, there must be described the properties of the 30 
principle dau structure that are required. Since many 
difTerent dau structures can be used to implement the 
algorithm, initially there is described abstractly the dau 
that can be stored by the dau structure and the opera- 
tions that can be used to manipulate this dau. The dau 35 
consists of set of ordered pairs. The first element in 
these ordered pain is referr^ to as the item number and 
the second element is called the key value. Ordered 
pairs may be added and removed from the set; however, 
at all times, the item numben of distinct ordered pairs 40 
must be distinct. It ts possible, through, for multiple 
ordered pairs to have the same key value. In this paper 
the item numben are integen between 1 and n, inclu- 
sive. Our default convention is that i ts an item number, 
k is a key value and h is a set of ordered pain. A total 45 
ordering on the pain of a set can be defined lexico- 
graphically as follows: 0»k) < (i',k') iff k < k' or (k == 
k' and i < i'). The dau structure should support a subset 
of the following operations. 

member (i»h) returns a boolean value of true if h con- 50 
uins an ordered pair with item number i, otherwise 
returns false. 

insert (i,k,h) adds the ordered pair (i,k) to the set h. 
delete (i«h) deletes the unique ordered pair with item 
number i from h. 5S 

changekey (i,k,h) is executed only when there is an 
ordered pair with item number i and h. This pair is 
replaced by (i,k). 

deletemin (h) returns the ordered pair which is smallest 
according to the total order defined above and de- 60 
letes this pair. If h is the empty set then the U^en 
"empty” is returned. 

predecessor (i.h) returns the item number of the ordered 
pair which immediately precedes the pair with item 
number i in the total order. If there is no predecessor 65 
then the token "smallest” is returned. 

Many different types and combinations of dau struc- 
tures can be used to support these operations efficiently. 


is further discussed below. 

The algorithm uses a “greedy” method to “grow” a 
minimum spanning tree. algorithm suits by choos- 
ing an arbitrary vertex from which to grow the tree. 
During each iteration of the algorithm a new edge ts 
added to the tree being constructed. Thus, the set of 
vertices spanned by the tree increases by exactly one 
vertex for each iteratioiL The edge which is added to 
the tree is the one with the smallest weight. FIG. 2 
shows this process in acdon. FIG. 2(a) shows the input 
graph, FIGS. 2(b) through 2(e) show several suges of 
the tree growth and FIG. 2(/) shows the final output of 
the minimum spanning tree. The solid edges in HGS. 
2(b) through 2(e) represent the current tree and the 
dotted edges represent candidates for addition to the 
tree. 

To efficiently find the edge to add to the current tree 
the algorithm uses the dau structure operations de- 
scribed above. As soon as a vertex, say v, is adjacent to 
some vertex which is currently spann^ it is inserted in 
the set h. The key value for v is the weight of the mini * 
mum edge between v and some vertex spanned by the 
current tree. The array dement prefer (v) is used to 
keep track of this minimum weight edge. As the tree 
grows, information is updated by operations such as 
insert (i,k,h) and changekey (Ukji). 

TABLE I 

D«u ftmentre opertbow aad oertilic«UoQ 
for MINSPAN 


OpentkNi 

Set of Ordered Ptin 

Delete 

Tni) 


(XMO) 


imaHett 

inten<6.900) 

a200),(*,S00) 


2 

dektmia 

(WOO) 

(UOO) 


mten(3,IOO) 

(6,500),(3.i00) 


6 

clu2ifckcy<6.4S0) 

(M50),(3,l00) 


nulltti 

iiifert(7.305) 

(M3OX(7,3O5),a.i00) 


6 

deietefaia 

(7.305),(3,i00) 

(6.450) 


imm(34X» 

(5450),O.5O5M3.l00) 


■nelkst 

clfaUfdLffy<7,495) 

(5O30Xa495),(3,S00) 


5 

dektiwi 

(7,493)/3,IOO) 

(5430) 


chufckcy<3.3 50) 

<3.350).(7.495) 


CfluJIesi 

imcrK4.X)D) 

(3.3SO),a495U4.TO) 


7 

delctembi 

<7.4«),(4.TO) 

(3450) 


chMgckcy<4.6SO> 

a495W4.650) 


7 

dclncmin 

ddcious 

dtieumm 

(4.630) 

0.495) 

(4.630) 

onpiy 



7 

The deletemin (h) opertiion is used to select the next 
vertex to add to the span of the cun'cnl tree. Note, the 
algorithm docs not explicitly keep a set of edges repre- 
senting the current tree. Implicitly, however, if (v.k) is 
returned by deletemin then prefer (v) is added to the 
current tree. 

In the first execution of the M INSPAN algorithm, 
the MINSPAN code is used and the principle dau 
structure is implemented with a balanced tree such as an 
AVL tree (Aderson-Verskii, G.M., and Landis, 

“An algorithm for the organization of information”, 
Soviet Math. Dokl., pp. 1259-1262, 3, 19621, a red-black 
tree (Ouibas, LJ., and Sedgewkk, R., "A dichromatic 
Framework for balanced trees", Proceedings of the 
Nineteenth Annual Symposium on Foundations of 
Computing, pp. 8-21. IEEE Computer Society Press, 
19781 or a b-tree [Bayer. R.. and McCreight, E., "Orga- 
nization of large ordered indexes", Acta Inform., pp 
173-189, I, 19721. In addition, an array of pointers in- 
dexed from 1 to n is used. The balanced search tree 
stores the ordered pairs in h and is based on the total 
order described earlier. The array of pointers is initially 
all nil. For each item i, the ith pointer of the srray is 
used to point to the location of the ordered pair wth 
item number i in the balanced search tree. If there is no 
such ordered pair in the tree then the ith pointer is nil. 
This array allows rapid execution of operations such as 
member (i.h) and delete 0»h). 

The certification trail is generated during the first 
execution as follows: When CHOOSE root < V is exe- 
cuted in the first step, the vertex which is chosen is 
output. Also, each lime insert (i«k.h) or changckey 
(i,k,h) are executed, predecessor Cuh) is executed ^ler- 
wards, and the answer returned is output This is illus- 
trated in column labeled ‘Trail” in the table above. 

The second execution of the MINSPAN algorithm 
also uses the MINSPAN code; however, the CHOOSE 
construct and the data structure operations are imple- 
mented differently than in the fist execution. The 
CHOOSE is performed by simply reading the first ele- 
roenl of the certification trail. This guarantees the ^e 
choice of a starting vertex is made in both executions. 
FIG. 4 depicts the principal dau structure used which is 
called an indexed linked list. The array is indexed from 
1 to n and contains pointers to a singly linked list which 
represents the current contents of h from smallest to 
largest. The ith element of the array points to the 
containing the ordered pair with the item number i if it 
is present in h; otherwise, the pointer is nil. The 0th 
element of the array points to the node containing (0, 
-INF). Initially, the array contains nil pointers except 
the Oih element. In order to implement the dau struc- 
ture operations, the following is provided. 

To perform insert (i,k,h), it is necessary to read the 
next value in the certification trail. This value, say j, is 
the item number of the ordered pair which is the prede- 
cessor of (i,k) in the current contents of h. A new linked 
list node is allocated and the trail information is used to 
insert the node into the dau itniclurc. Specifically, the 
ith array pointer is traversed to a node in the linked list, 
say Y. Of j - "smallest" then the Oih array fwintcr b 
traversed.) The new node b inserted in the Ibl just after 
. node Y and before the next node in the linked list (if 
there b one). The dau field in the new node b set to (i,k) 
and the ith pointer of the array b set to point to the new 
' node. FIG. 4 shows the insertion of (7,505) into the dau 
structure given that the certification trail value b 6. 


8 

FIG. 3(fl) b before the insertion and FIG. 3(6) is after 
the insertion. 

When the insert operation b performed, some checks 
must be conducted. First, the ith array pointer must be 
5 nil before the operation b performed. Section, the 
sorted order of the pairs stored in the linked Ibt must be 
preserved after the operation. That b. if (i',k') 
in the node before Cwk) in the linked Ibt and (i",k") is 
stored after 0.k). then < CO^) < 0". 

10 in the total order. If cither of these checks fads then 
execution halts and "error” b output 

To perform delete Cth) the bh array pointer b tra- 
velled and the node found b deleted from the linked 
tist Next, the ith array pointer b set to nil. FIG. 4 shows 
15 the deletion of item number 7 if one considers FIG. 3(«) 
as depicting the daU structure before the operation and 
FIG. 3(6) depicting it afterwards. When the delete oper- 
ation b performed one check b made. If the ith ar^y 
pointer b nQ before the operation then the execution 
20 haltt and "enor" b output 

To perform changckey Ctth) it suffices to perform 
delete fth) followed by insert CtLh). Note, thb means 
the next item in the cwtification trail b read. Also, the 
checks associated with both these two operations arc 
25 performed and the execution halts with “error” output 
If any check fails. 

To perform detekmin (h) the Oih array pointer is 
traversed To the bead of the Ibt and the next node in the 
list b accessed. If there b no such node then “empty" b 
30 returned the operation b complete. Otherwise, 
suppose the node b Y and suppose it conuins the or- . 
dered pair fijt), then the node Y b deleted from the Ibi, 
the ith array pointer b set to nil, and (i,k) b returned. 

Lastly, to perform member (i.h) the ith array pointer 
35 b examined. If it b nO then false b relumed, otherwise, 
true b rctunied. The predecessor (i,h) operation b not 
used int he second execution. 

Thb completes the description of the second execu- 
tion. To show that there b described a correct implc- 
40 menution of the certification trail method requires a 
proof The proof has several partt of varying difficulty. 
First, one must show that if the first execution b fault- 
free then it outpuu a minimum spanning tree. Second, 
one must show that if the first and second executions arc 
45 fault-free then they both output the same minimum 
spanning tree. Both these paru of the proof arc not 
difficult to show. 

The third more subtle part of the proof deals wih the 
situatioo In which only the second execution b fault- 
50 free. Thb means an incorrect certification may be 
generated in the first execution. In thb case, it must be 
shown that the second execution outpuU cither the 
correct fwiniwimn spanning tree or “error”. The checks 
that were described thb property by detecting any cr- 
55 tors that would prevent the execution from generating 
the correct output 

In the first execution each data structure operation 
cut be p^onned in 0(k>g(n)) tiine where [V3 = n. 
There «« nt most 0(m) »uch <^)erations and 0(in) addi- 
60 tional time overhe^ where (EJ=ni- Thus, the first 
execution can be performed in O(m]og(n)). It is noted 
that th is algorithm does not acl^ve the fastot known 
asymptotic time complexity which sppetrs ui Gabow, 
H.N., Galil. Z.. Spencer. T.. and Tarjan. R.E., "ElTi- 
65 cient algorithms for finding minimum spanning trees In 
undirected and directed graphs." Combinstorica 6, pp. 
109-122. 2, 19«6. However, the algorithm presented 
here has a significantly smaller constant of proportion- 


5,243,607 


ally which makes it competitive for reasonably sized 
graphs. In addition, it provides us with a relatively 
simple and illustrative example of the use of a certifica- 
tion trail. 

In the second execution each dau structure operation 
can be performed in 0(1). There arc still at most 0(m) 
such operations and 0(m) additional time overhead. 
Hence, the second execution can be performed in 0(m) 
time. In other words, because of the availability of the 
certification trail, the second execution is performed in 
linear lime. There are no known 0(m) time algorithms 
for the minimum spanning tree pr<^leni. KotnJos [26] 
was able to show t^t 0(m) comparisons suffice to find 
the minimum spanning tree. However, there is no 
known 0(m) time algorithm to actually find and per- 
form these comparisons. Even the related **vcrification 
problem has no known linear time solution. In the veri- 
fication problem the input consists of an edge weighted 
graph and a subtree. The output is “yes*’ if the subtree 
is the minimum spanning tree and “no” otherwise. The 
best known algorithm for this problem was created by 
Tarjan [Tarjan, R.E., “Applications of path compres- 
sion on balanced trees”, J. ACM, pp. 690-715, October, 
1979] and has the nonlinear time complexity of 0(- 
roa(m,n)), where a(m,n) is a functional inverse of Ack- 
erman's function. The fact that the data in a certification 
trail enables a minimum spanning tree to be found in 
linear time is, we believe, intriguing, significant, and 
indicative of the great promise of the certification trail 
technique. 

Huffman Tree Example 

Huffman trees represent another classic algorithmic 
problem, one of the original solutions being attributed 
to Huffman [Huffman, D., “A method for the construc- 
tion of minimum redundancy codes”, Proc. IRE, pp. 
1098-1 101, 40, 1952]. This solution has been used exten- 
sively to perform dau compression through the design 
and use of so-called Huffman codes. TTicsc codes arc 
prefix codes which arc based on the Huffman tree and 
which yield excellent dau compression ratios. The tree 
structure and the code design are based on the frequen- 
cies of individual characters in the dau to be com- 
pressed. See Huffman, D., ”A method for the construc- 
tion of minimum redundancy codes”, Proc. IRE, pp. 
1098-1101, 40, 1952, for information about the coding 
application. 

l^flnition 3.3. The Huffman tree problem is the fol- 
lowing: Given a sequence of frequencies (positive inte- 
gers) f^l], |{n], construct a tree with o leaves 

and with one frequency value assigned to each leaf so 
that the weighted path length is minimized. Specifi- 
cally, the tree should minimize the following sum: Xjh 
lX 4 /lcn(i)fti] where LEAF is the set of leaves, len(i) is 
the length of the path from the root of the tree to the 
leaf li,fli] is the frequency assigned to the leaf 1,. 

An example of a Huffman tree is given in FIG. 6. The 
input frequencies arc: f(l) = 35, ((2) = 20, f(3) = 44, 
K4) = 77, f(5) = 23, f(6) = 38, and f(7) =« 88. The 
frequencies appear inside the leaf nodes as the second 
elements of the ordered pairs in the figure. 

HUFFMAN ALGORITHM 

The algorithm to construct the Huffman tree uses a 
dau structure which b able to implement the insert and 
the ^letemin operations which arc defined above in the 
minimum spanning tree example. This type of dau 
structure is often called a priority queue. The algorithm 


10 


10 


15 


also uses the command allocate to construct the tree 
This command allocates a new node and returns a 
pointer to it. Each node is able to store an item number 
and a key value in the field called info, the item numben 
are in the set (1, .... 2n - 1) and the key values are 
sums of frequency values. The nodes also conuin fields 
for left and right pointen since the tree being con- 
structed is binary. 

The Huffman tree b built from the bottom up and the 
overall structure of the algorithm b based on the greedy 
"merging” of subtrees. An array of pointers called ptr b 
used to point to the subtrees as they are constructed 
Initially, n single vertex subtrees with the smallest asso- 
ciated frequency values. To perform a merge a new 
subtree b creat^ by first aJl^ting a new root node 
and next setting the left and right pointers to the two 
subtrees being merged. The frequency associated with 
the new subtm b the turn of the frequencies of the two 
subtrees being merged. In FIO. € the fiequency assoct- 
^ ated with ea^ subtree b shown at the second value in 
the root vertex of the subtree. Details of the algorithm 
are given below. Note that the priority queue data 
structure allows the algorithm to quickly determine 
which subtrees should be merged by enabling the two 
smallest frequency values to be found efficiently during 
each iteration. 

Table 2 below Olustrates the data structure operations 
performed when the Huffman tree in FIG. 6 b con- 
structed. For cooebeness the initial o inset operations 
have been omhted. The first column gives the set of 
ordered pain in h. The second column gives the result* 
of the two deletemin operations during each iteration. 
Note that thb column b labeled “Trafl” because it b 
also output as the certification traiL The third column 
records the elements which are inserted by the com- 
mand on line 13. 


25 


30 


35 


TABLE 2 


40 


45 


D«u (tnicfure opentioos tnd oerutic«tiom ihd 
for HUFFMAN 

Set of Ordered Pam 

Trai] 

Iftten 

(2.20U543).0.35M6,3IX(3,44),«77>. 

0M) 

(1.35M6.3l).<l.43U3.44W4.mn4l) 

(2JOK(5J3) 

(1.43) 

(I.43),(3.44W9.73M4,77W7.«) 

(U5X(S.3S) 

(9.73) 

<9.73),(4,77Wiai7kn.M) 

(S,43U3.44) 

OaiT) 

(iai7).(7.MMn.l50) 

C9,73).(4.77) 

(IU50) 

(IU50U12,n5) 

(I<XI7).(7.M) 

(12.175) 

(13.325) 

(11.l50).(13.n5) 

(13.325) 


50 


55 


60 


65 


First Execution of HUFFMAN 

In thb execution the code entitled HUFFMAN b 
used and the priority queue data structure b imple- 
mented with a heap [Tarjan, R.E., Dau Structures and 
Network Algorithms, Society for Industrial and Ap- 
plied Mathematics, I^adelphia, Pa. 1983] or a bal- 
anced search tree [Guibas, LJ., and Sedgewick, R., "A 
dichromatic framework for balanced trees”, IVoceed- 
ings of the Nineteenth Annual Symposium on Founda- 
tions of Computing, pp. 3-21, IEEE computer Society 
Press, 1978; Aderson-Vd-Verildi, O.M., and Landis, 
E.M., "An algorithm for the organization of infonna- 
tion”, Soviet Math. DoU., pp. 1259-1262, 3, 1962; 
Bayer, R., and McCreight, e!, "Organization of large 
ordered indexes", Acu Inform., pp. 173-189, 1, 1972]. 
Actually, any correct impleroenution b accepuble; 
however, to achieve a reasonable time complexity for 
this execution the suggested implemenution are desir- 


a-3. 


5.243,607 

11 12 

able, the certification trail is generated as follows: execution for the Huffman tree problem to be dramati- 

whenever deletcmin (h) is executed the item number cally more efficient than the first 

and the key value which are returned arc both output. In the first execution of HUFFMAN, each data sti^c* 
In the ubie, the certification trail is listed in the second ture operation can be performed in CXlog(n)) time 

5 where n is the number of frequencies in the input. There 
arc 0(n) such operations and 0(n) additional time over- 
Second Execution of HUFFMAN hcMd, hence, the execution can be performed in 0(n log 

This execution consists of two parts which may be (n)). This is the same complexity as the best known 

logically separated but which arc performed together. algorithm for constructing Huffman tr ees. 

In the first logical part, the code called HUFFMAN is 10 In the second code ex^ution of HUFFMAN, each 
executed again except that the dau structure operatioos dau structure operations is performed m constat time 
are treated differently. All insert operations arc not Further, verifying the data structure operatioiw are 
performed and all deletemin operations are performed correct takes only a constant time per operation. Thus, 
by simply reading the ordered pairs from the certifica* it follows that the overall complexity of the lecon 
lion tr«a. In the $econd logic*! part, the <Uu structure 15 execution it only 0(n). 
operations are “verified”. Note, by “verify" it doet not Convex Hull Example 

mean a formal proof of correctness based on the text of 

an algorithm. The problem of veriftcation can be fonnu- The convex hull problem is fundamenul in compuu- 
lated as follows: given a sequence of insert (i>k.l>) tkmal geometry. The certificatioo trail solution to the 

deletemin (h) operations (h) operations check to see if 20 generation of* convex bull is based on a lolulion due to 
the answers arc correct. It should be noted that while in Graham [Graham, R.L., "An cfRcient al^nthm for 
our example there is only one h, in general there can be determining the convex hull of a planar set , Informa* 
multiple h’$ to be handled. tk>n Processing Letten. pp. 132-133, I 1912] which » 

The description of the algorithm for the second exe* called “Graham’s Sew." (For basic definitions and 
cution can be further simpUned because only some re* 25 concepts in computatiooa] geometry, see the 
stiicted types of operation sequences are generated by Preparata and Shames [Preparata F.P., a^ Shamos 
the HUFFMAN code. First, it can be observed that all M.I., Compuutiooal geometry; an inuoductwn, Spnng- 
elements are ultimately deleted from h before the algo* cr*Verlag, New York, N.Y., 1985].) For nmplicity in 
rithm terminates; second, it can be further observed that the discussion which follows, it is^tsswicd the pointt 
when an clement is inserted into h, its key value is larger 30 arc in so<all^ "gener^ positi^ (this is, no thm 
than the key value of the last element deleted from h. points are colinear). It is not difncult lo remove this 
These two important observations allow us lo check a restriction, . . . 

seoucnce using the simplified method which is de- Definition 3.4. A convex region m R is a set of 
i^bed next. points, »y Q. in such that for every pur of points in 

Our simplified method uses an array of integers in* 35 Q the line Kgrocnt connecting the points lies entirely 
dexed from 1 to 2n - 1. This array is used to track the within Q. A polygon is a circularly ordered set of line 

contents of h. If the ordered pair fi,k) is in h, then array segments such that each line segment shares one of its 

element i is set to a value of k; and if no ordered pair endpoints with the preceding line segmwt and shar^ 

with item number i is in h, then array element i is set to the other endpoint with the succeeding line segment in 

a value of - 1. Initially, all array elements are set to - 1 40 the ordering. The shared endpoints are called the vcrii* 
and then operation sequence is processed. If insert (i,k) ccs of the polygon. A polygon may also be sp^ificd by 
IS executed then array element i is checked to sec if it an ordering of its vertices. A convex polygon is a 
contains — 1 . (The value of — 1 is an arbitrary selectioD gon which is the boundary of some convex region, ^e 
meant to serve only as an indicator.) If array element i convex hull of a set of points, S, in the Euclidean pl^^ 
does contain — 1, then it is set to k. If deletemin (h) is 45 is defined as the smallest convex polygon enclosing all 
executed, then the answer indtcaied by the certification the points. This polygon is uiuque aj^ its vertices arc a 
trail, say Gtk), is examined. Array element i is checked subset of the points in S. It is specified by a countcr- 
to see if it contains k. In addition, k is compared to the clockwise sequence of iu vertices. .... . 

key value of previous element in the certification trail FIG. 8(e) shows a convex hull for the points indicated 
sequence to see if it is greater than or equal to that 50 by black dots. Gra h a m s cw algorithm given below 
previous value. If both these checks succeed then array constructs the convex hull incremwtally tn a counter- 
element i is set to —1. clockwise fashion. Sometimes it is necessary for w 

If any of the checks just described above fails, then algorithm to “backup" the coi^ructioo by throwing 
the execution halts and “error" is output. Otherwise the some vertices out and then continuing. The first st^ of 
operation sequence is considered “verified". It can be 55 the algorithm selects an "extreme point and calls it p|. 
rigorously shown that the checks described are suffi- The next two steps sort the remaining pomis in a ^y 
cicni for determining whether the answers given in the which is depicted in FIG. 8{fl), It is not hwd to show 
certification trail are correct; this proof, however, has that after these three steps l^ points when ® 
been omitted for the sake of brevity. Finally, it is worth order. Pi, pt* > < . » p«, foim a simple polygon; although, 
noting that to combine the two logical parts of this 60 in general, this polygon is not convex. 

execution, one can perform the daU Graham’s Scan Algorithm 

in tandem with the code execution of HUFFMAN. , ^ t i u 

Each lime an insert or deletemin is encountered in the It is possible to think of Graham t scan algonthm ^ 
code, the appropriate set of checks are performed. removing points from this simple polygon *t 

65 comes convex, the main FOR loop iteration adds veru- 
Time Complexity Comparison of the Two ExecotXMis ces to the polygon under constructioo and the inner 
Again, as in the minimum spanning tree example, the WHILE loop removes vertices from the construed^. 
availabUity of the certification trail permitt the second A point is removed when the angle test performed at 



13 


5.243,607 


Sicp 6 rcvcAls thi( it is not on the convex hull because 
it falls within the triangle defined by three other points. 
A “snapshot” of the algorithm given in FIG. S(6) shows 
that qs is removed from the hull. The angle formed by 
^4«q3^ P6 is less than 180 degrees. This means, qs lies 
within the triangle formed by q4« pi, pt- (Note, qi »* pi.) 
In general, when the angle test is performed, if the angle 
form^ by qm-l,qm.pk is less than 180 degrees, then 
qm lies within the triangle formed by qm — l.pl.pk. 
Below it will be revealed that this is the primary tnfor- 
ma^n relied on in our certification trail. When the 
main FOR loop is complete, the convex hull has been 
constructed. 

First Execution of Graham's Scan 

In this execution the code CONVEXHULL is used. 
The certificatioo trail is generated by adding an output 
statement within the WHILE loop. Specifically, if an 
angle of less than 180 degrees is found in the WHILE 
loop test then the four tuple consisting of 
qm,qm- l,pl,pk is output to the certification trail. 
Table 3 below shows the four tuples of points that 
would be output by the algorithm when run on the 
example in FIG. 8 . The points in Table 3 arc given the 
same names as in FIG. 9(a). The fmal convex hull points 
ql, ... qm are also output to the certification trail. 
Strictly speaking the trail output docs not consist of the 
actual points in R^. Instead, it consists of indices to the 
original input dau. This means if the original data con- 
ttsts of si.S2^ . ■ , s« then rather than output the clement 
in R2 corresponding to s, the number i is output. It is not 
hard to code the program so that this is done. 

TABLE 3 


14 


10 


IS 


20 


25 


30 


Firn MTt of ccruftcttkM tnil for Cnhun'i icmn 

Poifi{ OCX on convei Ml 

Three luirounding pouitt 

P5 



P3.P1.P* 


P6*Pl.Pt 


35 


Second Execution for the Convex Hall Problem 

Let the certificatioo trail consist of a set of four tu- 
ples, (xi.at.bi.ci), (x242»b2.C3) (x^,a/^,bACr) followed 

by the supposed convex hull, qi.q:, . . . ,qm- The code 
for CONVEXHULL is not used in this execution. In- 
deed, the algorithm performed is dramatically difTerent 
than CONVEXHULL. 

It consists of five checks on the trail dau. 

Firsu the algorithm checks for i € (1, . . . ,r) that x/ lies 
within the triangle defined by a/.bte and c#. 

Second, the algorithm checks that for each triple of 
counterclockwise consecutive points on the supposed 
convex hull the angle formed by the points is less 
or equal to 180 degrees. 

Third, it checks that there is a one to one correspon- 
dence between the input points and the points in (xi, 

. ,Xr) U (qi, . . . .qm). 

Fourth, it checks that for i € (1, . . . ,r), a^bi, and c/arc 
among the input points. 


40 


45 


50 


55 


actually consists of indices into the input dau. this does 
not unduly complicate the checks above; instead it 
makes them easier. The correctness and adequacy of 
these checks must be proven. 

Time Complexity of the Two Executions 

In the first execution the sorting of the input poinu 
takes 0(iiJog(n) time where n b the number of input 
points. One can show chat thb cost dominates and the 
overall complexity b O(alog(o)). 

It b possible to note that, ucilike the minimum span- 
ning tree example and the Huffman tree example, the 
convex hull example utSizes an algorithm in the second 
executioo that b not a close variant of that used int he 
first executkni. However, like the previous two exam- 
ples, the second executioa for the convex hull problem 
depends fundamenuUy on the informatioQ in the certiH- 
cation trafl for effide]^ and performance. 

Concurrency of Executions 

In the three examples discussed above, it b possible to 
start the second execution before the first execution has 
terminated. Thb b a highly desirable capability when 
additional hardware b available to run the second exe- 
cution (for example, with multiprocessor machines, or 
machines with coprocessors or hardware monitors). 

In the case of the minimum spanning tree problem, 
the two executions can be run concurrently. It b only 
necessary for the second execution to read the certifica- 
tion trail as it b generated— one hem number at a ttme. 
Thus, there b a slight time lag in the second execution. 
The case of the Huffman tree problem b similar. Both 
executions can be run concurrently if the second execu- 
tion reads the certification trail as it b generated by the 
first executioo. 

The case of the convex hull problem b not quite as 
favorable, but it b still possible to partially overlap the 
two executions. For example, as each 4-tuple of points b 
generated by the first execution, it can Ik checked by 
the second execution. But the second execution must 
wait for the points on the convex huU to be output at the 
end of the first executioo before they can be checked. 

An additional opportunity for overlapping execution 
occurs when the system has a dedicated comparator. In 
thb case h b sometimes possible for the two executions 
to send their output to the comparator as they generate 
it. For example, thb can be done in the minimum span- 
ning tree problem where the edges of the tree can be 
sent individually as they arc discovered by both execu- 
tions. 

Comparbon of Techniques 

The certification trail approach to fault tolerance, 
whether implemented in hardware or software or tome 
combination thereof, has resemblances with other fault 
tolerant techniques tiut have been previously proposed 
and examined, but in each case there are significant and 
fundamental dbtinctioDS. These dbtinctions are primar- 


Fifth, it checks that there b a unique point among the 60 ily related to the generation and character of the certifi- 


points on the suppo^ convex hull which b a local 
extreme point. A point q on the hull b a local extreme 
point if its predecessor in the counterclockwise order- 
ing has a strictly smaller y coordinate and its succes- 
sor in the ordering has a smaller or equal y coordi- 
nate. 

If any of these checks fail then execution halts and 
“error” b output As mentioned above, the trail dau 


cation trail and the manner in which the secondary 
algorithm or system uses the certification trail to indi- 
cate whether the execution of the primary system or 
, algorithm was in error and/or to produce an output to 
65 be compared with that of the primary system. 

To b^g, the certification trail iq>proach might be 
viewed as a form of N-version programming [Chen, L., 
and Avizienb A., “N-versk>n programming: a fault 


5 , 243.607 


15 


tolermnt approach to reliability of software operation, 

Digest of the 197S Fault Tolerant Computing Sympo- 
sium, pp. 3-9, IEEE computer Society Press, 1971; 
Avizieniv A., and Kelly J.. “Fault tolerance by design 
diversity: concepts and experiments,” Computer, yoh 5 
n, pp. 67-80. August, 1984]. This approach specifies 
that N diflerent implementations of an algorithm be 
independently executed with subsequent comparison of 
the resulting N outputs. There ts no relationship among 
the executions of the dilTerent versions of the algo- 10 
rithms other than they all use the same input; e^h algo- 
rithm is executed independently without any infonna- 
lion about the execution of the other algorithms. In 
marked contrast, the certification trail approach allows 
the primary system to generate a trail of information 15 
while executing itt algorithm that b critical to the sec- 
ondary system’s execution of itt algorithm. In effect, 
N-version programming can be thought of relative to 
Ihe certification trail approach as the employment of a 

null trail. . ^ 

A softwarc/hardware fault tolerance technique 
known as the recovery block approach [Randell, Ba., 
•‘System structure for software fault tolerance,” IEEE 
Trans, on Software Engineering voL 1. pp- 202-232, 
June, 1975; Anderson. T., and Lee, P., Fault tolerance; 25 
principles and practices, Prentke-Hall, Englcwc^ 
aiffs, NJ., 1981; Lee, Y. H. and Shin, K. G., “Design 
and evaluation of a fault-tolerant multiprocessor using 
hardware recovery blocks,” IEEE Trans. Comput, vol 
033. pp. 113-124. February 1984.] uses accept^ 30 
tesu and alternative procedures to produce what b to 
be regarded as a correct output from a program. When 
using recovery blocks, a program b viewed as being 
stnictiired into blocks of operations which after execu- 
tion yield outputs which can be tested in some mfonnal 35 
sense for correctness. The rigor, completeness, and 
nature of the acceptance test b left to the program de- 
signer, and many of the acceptance tests that have been 
proposed for use tend to be somewhat straightforward 
[Anderson. T., and Lee, P., Fault tolerance: principles 40 
and practices, Prcntice-Hi^, Englewood Cliffs, NJ., 
1981]. Indeed, formal methodologies for the definition 
and generation of acceptance tests have thvs far not 
been established. Regardless, the certification iraU no- 
tion of a secondary system Uui receives the same input 45 
as the primary system and executes an algorithm that 
takes advantage of thb trail to efficiently produce the 
correct output and/or to indicate that the execution of 
the first algorithm was correct docs not fall into the 
category of an acceptance test. 50 

A watchdog processor b a small and simple (relative 
to the primary system being monitored) hardware mon- 
itor that detects errors examining infonnation relative 
to the behavior of the primary system [Mahroood, A., 
and McCluskey. E., “ConcuiTcnt error detection using 55 
watchdog processors,” IEEE Trans, on Computers, 
vol. 37, pp. 160-174. February. 1988; Mahmood, A., 
and McCluskey, E., “Concurrent error detection using 
watchdog processors— a survey,” IEEE Trans, on 
Compuicis, vol. 37. pp. 160-174, February, 1988; Nam- 60 
joo, M., and McCluskey, “Watchdog processors and 
capabQity checking,” Digest of the 1982 Fault Tolerant 
Computing Symposhim, pp. 245-248, IEEE Computer 
Society Press, I982.J. Error detection using a watchdog 
processor b a two-phase process: in the s^-up phw, 65 
tnformatioa about system behavior b provided a prion 
to the watchdog processor about the system to be moni- 
tored; in the monitoring phase, the watchdog processor 


16 

collects or b sent infonnation about the operation of 
system to be compared with that which was provided 
during the set-up phase. On the basb of this comparison, 
a decision b made by the watchdog procewr as to 
whether or not an error has occurred. The information 
about system behavior by means of which a watchdog 
processor must monitor for errors includes memory 
access behavior [Namjoo. M., and McOu^ey, E., 
“Watchdog processors and capability checking ” Di- 
gest of the 1982 Fault Tolerant Computing Symposium, 
pp. 245-248, IEEE Computer Society Press, 1W2). 
control and program flow [Eifcrt, J. B. and Shen, J. P., 
“Processor monitoring using asynchronous signatured 
instructioo streams,” Dig. I4th Int Conf. Fault-Toler- 
ant CompuL, pp, 394-399. 1984, June 20-22; Iyengar. 
V. S. and Kinney, L. L., “Concurrent fault detectioo m 
microprogrammed control units," IEEE Trans. Corn- 
put, vol C-34, pp. 810-821, September 1985; Kane, J. 
R. and Yau, S. S„ “Concurrent software fault detection, 

” Trans. Software Eng., vol SE-1, pp. 87-99, 
March 1975; Lo, D., “Watchdog processor and struc- 
tural integrity checking, ” IEEE Trans. Comput, vd. 
C-31, pp. 681-685, July 1982; Namjoo. M., “Tcchnk|u« 
for concurrent testing of VLSI processor operatioii," 
Dig. 1982 Int Teat Conf., pp. 461-468, November 1982; 
Namjoo. M.. “CERBERUS-16: An arc^iccturc for a 
genera] purpose watchdog processor,” Dig, Papers 13th 
Annu. Int Sump. Fault Tolerant Comput., pp. 216-219, 
June, 1983; Shen, J. P. and Schuette, M.A., “On-line 
self-monitoruig using signatured instruction streams," 
Proc. 1983 Int Test Conf. pp. 275-282, October, 19S>, 
Sridhar. T and Thatte, S. M.. “Concurrent checking of 
program flow in VLSI processors,” Dig. 1982 InL Test 
Conf., pp. 191-199, November, 1982; 46.47], or reason- 
ableo^ of results [Mahmood, A., Lu, D. J- and 
McCluskey, E. J., “Concurrent fault detection using a 
watchdog processor and assertions," Proc. 1983 Int, 
Test Conf.. pp. 622-628, October, 1983; Mahmood, A. 
Efsoi, a. and MeOuskey, EJ.. “concurrent system 
level error detection using a watchdog processor ” 
Proc, 1985 Int Test conf., pp. 145-152, November, 
1985]. Using physical fault injection techniques, dbiri- 
butioDS of errors that could be detected using such types 
of infonnation have been determined for some specific 
systems [Schmid, M., Trapp, R., Davidoff, A., and 
son, G., “Upset exposure by means of abstraction venfi- 
cation," Dig. of the 1982 Fault Tolerant Computing 
Symposium, pp. 237-244. June, 1982; Gunneflo, U,, 
Karlsson, J., and Torin, J., “Evaluation of error detec- 
tion schemes for using &ult injection by beavy-k» radi- 
ation,” Dig. of the 1989 Fault Tolerant Computing 
Symposium, pp. 340-347, June, 1989], and t^ perfor- 
mance of models of error monitoring techniques that 
could be realized in the form of watchdog processors 
have been analyzed [Blough, D., and Masson, G., “Per- 
formance analysb of a generalized coocurrcni error 
dctecUon procedure.” IEEE Trans, on Compuicn vol 
39, January, 1990.]. However, in contrast to the cemfi- 
catioo trail technique, a watchdog processor uses only a 
priori defined behavior checks, none of which is suffi- 
cient together with the input to the primary system to 
efficiently reproduce the output for direct comparison 
with that of the primary system. 

Related to the watchdog processor approach a that 
of using executable assertions [Andrews, D., “Software 
fault tolerance through executable assertions,” Rcc. 
12th Asilomar Conf. Circuits, Syst, Comput, pp. 

1978, November 6-8; Andrews, D., *TJsiiig 


5 , 243 . 

17 

executable assertions for testing and fault tolerance/' 
Dig. 9th Anna. Int. Sump. Fault-Tolerant Compute pp. 
102-105. 1979, June 20-22; Mahwood, A.. Lu. D. J. and 
McCluskey E. J., “Concurrent fault detection using a 
watchdog processor and assertions,” Proc. 1983 Int. 5 
Test Conf., pp. 622-628, October 1983]. An assertion 
can be defin^ as an invariant relationship among vart* 
ables of a process. In a program, for examples, asser- 
tions can be written as logical statements and can be 
inserted into the code to signify that which has been 10 
predetermined to be invariably true at that point in the 
execution of the program. Assertions are based on a 
priori determined properties of the primary system or 
algorithm. This, however, again serves to distinguish 
executable assertion technique from the use of cenifica- IS 
tion trails in that a certification trail is a key to the 
solution of a problem or the execution of an algorithm 
that can be utilized to efficiently and correctly produce 
the solution. 

Algorithm-based fault tolerance (Huang, and 20 

Abraham, J., ”Algorithm-based fauh tolerance for ma- 
trix operations,” IEEE Trans, on Computers, pp. 
518-529, vol. C-33, June, 1984; Nair, V,, and Abraham, 

J., '^General linear codes for fault-tolerant matrix opera- 
tions on processor arrays,” Dig. of the 1988 Fault Tol- 25 
crant Computing Symposium, pp. 180-185, June, 1988; 
“Fault tolerant FTT networks,” Dig. of the 1985 Fault 
Tolerant Computing Symposium, June, 1985] uses error 
detecting and correcting codes for performing reliable 
compuutions with specific algorithms. This technique 30 
encodes data at a high level and algorithms are specili- 
cally designed or modified to operate on encoded dau 
and produce encoded output data. Algorithm-based 
fault tolerance is distinguished from other fault toler- 
ance techniques by three characteristics: the encoding 35 
of the data used by the algorithm; the modification of 
the algorithm to operate on the encoded data; and the 
distribution of the computation steps in the algorithm 
among computational units, ft is assumed that at most 
one computational unit is faulty during a specified time 40 
period. The error detection capabilities of the al- 
gorithm-based fault tolerance approach are directly 
related to that of the error correction encoding utilized. 
The certification trail approach does not require that 
the dau to be executed he modified nor that the funda- 45 
mcnul operations of the algorithm be changed to ac- 
count for these modifications. Instead, only a trail indic- 
ative of aspects of the algorithm’s operations must be 
generated by the algorithm. As seen from the above 
examples, the production of this trail docs not burden 50 
the algorithm with a significant overhead. Moreover, 
any combination of compuUtkmal errors can be han- 
dled. 

Recently Blum and Kannan [Blum, M., and Kannan, 

S., “Designing programs that check their work,” Pro- 55 
cecdings of the 1989 ACM Symposium on Theory of 
Computing, pp. 86-97, ACM Press, 1989] have defined 
what they call a program checker. A program checker 
IS an algorithm which checks the output of an ocher 
algorithm for correctness and thus it is simDar to an 80 
acceptance test in a recovery block. An example of a 
program checker is the algorithm developed by Tarjan 
(Tarjan, R. E., “Applications of path compression on 
balanced trees,” J. ACM, pp. 690-715. October, 1979] 
which takes as input a graph and a supposed minimum 65 
spanning tree and indicates whether or not the tree 
actually is a minimum spanning tree. The Blum and 
Kannan checker is actually more general than this be- 


cause it is allowed to be probabilistic in a carefully 
specified way. There arc two main difTerenccs between 
this approach and the certification trail approach. First, 
a program checker may call the algorithm it is checking 
a polynomial number of times. In the certification trail 
approach the algorithm being checked is run once. 
Second, the checker it designed to work for a problem 
and not a specific algorithm. That is, the checker design 
is based on the input/output specification of a problem. 
The certificatioo trail approach is explicitly algorithm 
being checked is run once. Second, the checker is de- 
signed to work for s problem and not a specific algo- 
rithm. That is, the checker design b based on the input- 
/output tpecificatioa of a problem. The certification 
trail approach b explicitly algorithm oriented. In other 
words, a specific algorithm for a problem b modified to 
out put a oertificatioiis traiL Thb trail sometimes allows 
the second execution to be faster than any known pro- 
gram checkers for the problem. Hib b the case for the 
miniinum Spanning tree problem. 

Other hardware and software fault tolerance and 
error momtoring techniques have beeq proposed and 
studied that might be thought of as bearing some resem- 
blance to the certification trail approach. Extensive 
summaries and descriptioos of these techniques can be 
found in the literature [Siewiorek, D., and Swarz, R., 
The theory and practice of reliable design. Digital 
Press, Bedford, Mass., 1982; Avizienb, A., “Fault toler- 
ance by means of external monitoring of computer sys- 
tems,” Proceedings of the 1981 Natiooa] Computer 
Conference, pp. 27-40, AFIPS Press, 1980; Johnson, B., 
Design and vialysb fault tolerant digital systems, 
Addboa- Wesley, Reading, Mass., 1989; Mahmc^ A., 
and McCluskey, E., “Co^urrent error detection using 
watchdog processors— a survey," IEEE Trans, on 
Computers, vol. 37, pp. 160-174, February. 1988]. Ex- 
amination of these techniques reveals, however, that in 
each case there are fundamental distinctions from the 
certificatioo trail approach. In summary, the certifica- 
tion traD approach stands along in its employment of 
secondary algorithms/systems for the compuution of 
an output for comparison that because of the availability 
of the trail not only proceeds in a more efficient manner 
than that of the primary but also can indicate whether 
the execution of the priinary algorithm was correct 

Although the invention has been described in detail in 
the foregoing embodiments for the purpose of illustra- 
tion, ft b to be understood that such detail b solely for 
that purpose and that variations can be made therein by 
those skilled in the art without departing from the spirit 
and scope of the invention except as it may be described 
by the following claims. 

What b claim^ b: 

1. A method for achieving fault tolerance in a com- 
puter system having at least a first central processing 
unit and a second central processing unit comprising the 
steps of: 

executing a first algorithm in the first central process- 
ing unit on input so that a first output and a certifi- 
cation trail are produced; 

executing a second algorithm in the second central 
processing unit on tl^ input and on the certification 
trail so that a second output b produced, said sec- 
ond algorithm having a faster execution time than 
the first algorithm for a given input; and 
comparing the first and second outputs such that in 
error result b produced if the first and second out- 
puts are not the same. 



19 


5 , 243.607 


20 


10 


2. A method as described in claim 1 wherein the step 
of executing the second algorithm includes the step of 
determining whether the certification trail is in error. 

3. A method as described in claim 2 including before 
the step of executing the first algorithm, there is the step 
of duplicating the input such that the input that is pro- 
vided to the step of executing the Arst algorithm is also 
the input that is provided to the step of executing the 
second algorithm. 

4. A method as described in claim 3 wherein the step 
of executing the first algorithm includes the step of 
determining whether the first output is in error. 

5. A method as descnbed in claim 4 wherein the step 
of executing the first algorithm includes the step of 15 
determining whether the second output is in error. 

6 . A method as described in claim S wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the second 
processing unit even if the certification trial produced ^ 
by the first algorithm when the first algorithm is exe- 
cuted by the first processing unit is incorrect. 

7. A 'method as described in claim 1 wherein the 
second algorithm is derived from the first algorithm. 

4. A computer system comprising: 
a first computer comprising: 
a first memory, 

a first central processing unit in communication with 
the memory, 

a first input port in communication with the memory 
and the first central processing unit, 
a first algorithm dispoWd in the first memory, said 
first algorithm produces a first output and produces 
a certification trail based on input receiv^ by the 35 
input port when the first algorithm is execut^ by 
the first central processor; 
a second computer comprising a second memory, 
a second central processing unit in communication 
with the second memory and the first central pro- ^ 
cessing unit; 

a second input port in communication with the sec- 
ond memory and the second central processing 
unit; 

a second algorithm disposed in the second memory, 
said second algorithm produces a second output 
based on the input and the certification trail when 
the second algorithm is executed by the second 
central processing unit, said second algorithm hav- 
ing a faster execution time than the first algorithm 
for a given input; and 

a mechanism for comparing the first and second out- 
puts such that an error result is produced if the first 
and second outputs are not the same. 

9. A computer as described in claim I wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the second 
processing unit even if the certification trail produced 


by the first algorithm when the first algorithm is exe- 
cuted by the first processing unit is incorrect. 

10. A computer system as described in claim 9 
wherein the mechanism for comparing is a comparator. 

11. An apparatus as described in claim 10 wherein the 
second algorithm is derived from the first algorithm. 

12. A method for achieving fault tolerance in a cen- 
tral processing unit comprising the steps of: 

executing a first algorithm in the central processing 
unit on input so that a first output and a certifica- 
tion trail are produced; 

executing a second algorithm in the central process- 
ing unit on the input and 00 the certification trail so 
that a second output is produced, said second algo- 
rithm having a faster execution time than the first 
algorithm for a given input; and 
comparing the first and secx>od outpuu such that an 
error result is produced if the first and second out- 
puts are not t^ same. 

13. A method as described in claim 12 wherein the 
second algorithm generates the second output correctly 
when the second algorithm is executed by the process- 
ing unit even if the certification trail prt^uced by the 
first algorithm when it is executed by the processing 
unit is incorrect 

14. A method as described in claim 13 wherein the 
second algorithm is derived from the first algorithm. 

15. A computer comprising: 
a memory, 

a central processing unit in communicatioa with the 
memory, 

a first input port in communication with the memory 
and the central processing unit, 
a first algorithm disposed in the memory, said first 
algorithm produces a first output and a certifica- 
tion trail based on input receiv^ by the input port 
when the input b executed by the central process- 
ing unit; 

a second algorithm dbposed in the memory, said 
second algorithm produces a second output based 
on the input and on at least a portion of the certifi- 
cation tr^ when the second algorithm b executed 
by the central processing unit, said second algo- 
rithm having a faster execution time than the first 
algorithm for a given input; and 
a mechanism for comparing the first and second out- 
puts such that an error result b produced if the first 
and second outputs are not the same. 

16. A computer as described in claim 15 wherein the 
50 second algorithm generates the second output correctly 

when the second algorithm b executed by the process- 
ing unit even if the certification trail prc^uced by the 
first algorithm when the first algorithm b executed by 
the processing unit b incorrect 

17. A computer as described in claim 16 wherein the 
mechanbm for comparing b a comparator. 

18. An apparatus as described in claim 15 wherein the 
second algorithm b derived from the first algorithm. 

• • • • • 


25 


30 


45 


55 


60 


65 



The Twenteth International Symposium on 


Fault-Tolerant Computing (1990) 


Using Certification Trails to Achieve Software Fault Tolerance 

Gregory F. Sullivan^ 

Gerald M. Masson^ 

Dept, of Computer Science, Johns Hopkins Univ., Baltimore, MD 21218 


Abstract 

We introduce a conceptually novel and powerful tech- 
nique to achieve fault tolerance in hardware and soft- 
ware systems. When used for software fault tolerance, 
this new technique uses time and software redundancy 
and can be outlined as follows. In the initial phase, 
a program is run to solve a problem and store the re- 
sult. In addition, this program leaves behind a trail of 
data which we call a ctriificaiion iraiL In the second 
phase, another program is run which solves the origi- 
nal problem again. This program, however, has access 
to the certification trail left by the first program. Be- 
cause of the availability of the certification trail, the 
second phase can be performed by a less complex pro- 
gram and can execute more quickly. In the final phase, 
the two results arc compared and if they agree the re- 
sults are accepted as correct; otherwise an error is indi- 
cated. An esseirtial aspect of this approach is that the 
second program must always generate either an error 
indication or a correct output even when the certifica- 
tion trail it receives from the first program is incorrect. 
We formalize the certification trail approach to fault 
tolerance and illustrate it by applying it to the funda- 
mental pr^iblem of finding a minimum spanning tree. 
We discuss cases in which the second phase can be 
run concurrently with the first and act as a monitor. 
We compare the certification trail approach to other 
approaches to fault tolerance. Because of space lim- 
itations we have ommited examples of our technique 
applied to the Huffman tree, and convex hull problems. 
These can be found in the full version of this paper. 

1 Introduction 

In this paper we introduce a novel and powerful tech- 
nique for achieving fault tolerance in systems. Al- 
though applicable to both hardware and software, we 
restrict our discussion of this technique in the foUow- 
Ing to software fault tolerance. To explain our new 

* Research pArtially supported by NSF Grants CCR-8910S69 
CCR.8908092. 

^Research p&riaUy supported by NASA Grant NSG 1442. 


technique for software fault tolerance, we will first dis- 
cuss a simpler fault tolerant software method. In this 
method the specification of a problem is given and an 
algorithm to solve it is constructed. This algorithm is 
executed on an input and the output is stored. Next, 
the same algorithm is executed again on the same in- 
put and the output is compared to the earlier output. 
If the outputs differ then an error is indicated, oth- 
erwise the output is accepted as correct. This soft- 
ware fault tolerance method requires additional time, 
so called lime redundancy [14, 22]; however, it requires 
no additional software. It is particularly valuable for 
detecting errors caused by transient fault phenomena. 
If such faults cause an error during only one of the ex- 
ecutions then either the error will be detected or the 
output will be correct. 

A variation of the above method uses two separate 
algorithms, one for each execution, which have been 
written independently based on the problem specifica- 
tion. This technique, called N-version programming[8, 
4] (in this case N=2), aUows for the detection of errors 
caused by some faults in the software in addition to 
those caused by transient hardware faults and utilizes 
both time and software redundancy. Errors caused 
by software faults are detected whenever the indepen- 
dently written programs do not generate coincident 
errors. 

The technique we will describe is designed to achieve 
similar types of error detection capabilities but expend 
fewer resources. The central idea, as illustrated in Fig- 
ure 1, is to modify the first algorithm so that it leaves 
behind a trail of data which we call a certification trail. 
This data is chosen so that it can allow the the sec- 
ond algorithm to execute more quickly and/or have a 
simpler structure than the first algorithm. As above, 
the outputs of the two exerutions are rompared and 
are considered correct only if they agree. how- 

ever, we must be careful in defining this method or 
else its error detection capability might be reduced 
by the introduction of data dependency between the 
two algorithm executions. For example, suppose the 
first algorithm execution contains a error which causes 
an incorrect output and an incorrect trail of data to 


423 

PH6CKHN6 PAGE BLANK NOT FJLMED 


CH 2877-g«(y0OO(VW23«O1.0O - 1990 IEEE 




Figure 1: Certification trail method. 


be generated. Further suppose that no error occurs 
during the execution of the second algorithm. It still 
appears possible that the execution of the second al* 
gorithm might use the incorrect trail to generate an 
incorrect output which matches the incorrect output 
given by the execution of the first algorithm. Intu- 
itively, the second execution would be “fooled" by the 
data left behind by the first execution. The definitions 
we give below .exclude this possibility. They demand 
that the second execution either generates a correct 
answer or signals the fact that an error has been de- 
tected in the data trail. Finallyi it should be noted that 
in Figure 1 both executions can signal an error. These 
errors would include run-time errors such as divide-by- 
zero or non-terminating computation. In addition the 
second execution can signal error due to an incorrect 
certification trail. 


2 Formal Definition of a Certi- 
fication Trail 

In this section we will give a formal definition of a 
certification trail and discuss some aspects of its real- 
izations and uses. 

Definition 2.1 A problem P is formalized as a rela- 
tion (that is, a set of ordered pairs). Let D be the 
domain (that is, the set of inputs) of the relation P 
and let S be the range (that is, the set of solutions) 
for the problem. We say an algorithm A solves a prob- 
lem P iff for all d G D when d Is input to A then an 
j G 5 is output such that (d, s) € P. 

Definition 2.2 Let P : D — * S be a problem. Let 
T be the set of certification trails. A solution to this 
problem using a certification trail consists of two func- 
tions Fi and J ^2 with the following domains and ranges 


m 


Fj : D — S X T and Fj : D x T S U {error}. The 
functions must satisfy the following two properties: 

(1) for all d 6 D there exists s G S and 

there exists i € T such that 

Fi(d) = (j,<) and Fj(d,<) = # and (d, #) € P 

(2) for all d G D and for all < G T 

cither (Fj(d,t) = $ and (d, s) G P) or 
F 2 (d,<) = error. 


The definitions above assure that the error detec- 
tion capability of the certification trail approacli U 
comparable to that obtained with the simple time re- 
dundancy approach discussed earlier. That is, if tran- 
sient hardware faults occur during only one of the ex- 
ecutions then cither an error will be detected or the 
output will be correct. It should be further noted, 
however, the examples to be considered will indicate 
that this new approach can also save overall execution 
time. 

The certification trail approach also allows for the 
detection of faults in software. As in 2-vcrsion pro- 
gramming, separate teams can write the algorithms for 
the first and second executions. Note that the speci- 
fication now must include precise information describ- 
ing the generation and use of the certification trail. 
Because of the additional data available to the sec- 
ond execution, the specifications of the two phases 
can be very different; similarly, the two algorithms 
used to implement the phases can be very different. 
This is illustrated by the convex hull example in the 
full paper. Alternatively, the two algorithms can be 
very similar, differing only in data structure manipu- 
lations. Thb is illustrated by the minimum spanning 
tree example considered later. When significantly dif- 
ferent algorithms are used, the probability that both 
algorithms will contain or be effected by faults which 
generate matching errors should be reduced. When 
very similar algorithms are used it is sometimes pos- 
sible to save programming effort by sharing program 
code. While this reduces the ability to detect errors 
in the software it does not change the ability to detect 
transient hardware errors as discussed earlier. 

Throughout this section we have assumed that our 
method is implemented with software; hoM^cver, it is 
clearly possible to bnpleii>ent the certification trail tech- 
nique by using dedicated hardware. It is also possible 
to generalize the basic two-level hierarchy of the cer- 
tification trail approach as illustrated in Figure 1 to 
higher levels. Finally, we note that a wide variety of 



424 




m 

1 


or eoc?? QtJAj.,'ry 




approaches to software and hardware fault tolerance 
have been proposed which bear resemblances to the 
certification trail approach; we contrast our method 
to the most closely related ideas, A more comprehen- 
sive comparison appears in the full paper. 

3 Minimum Spanning Tree Ex- 
ample 

In this section we illustrate the use of the certification 
trail method by applying it to the minimum spanning 
tree problem. Because of space limitations we have 
ommited other applications^ e.g,, to the Huffman tree 
and the convex hull problems. It should be stressed 
here that we believe the technique has wide applica- 
bility and these problems were chosen simply for illus- 
tration. 

The minimum spanning tree problem has been ex- 
amined extensively in the literature and an historical 
survey is given in [11]. Our certification trail approach 
is applied to a variant of the Prim/Dykstra algorithm 
[19, 9] as explicated in [24], We will begin our dis- 
cussion of the application of the certification trail ap- 
proach to the minimum spanning tree problem with 
some preliminary definitions. 

Definition 3.1 A graph G = consists of a ver- 

tex set V and an edge set E, An edge is an un- 
ordered pair of distinct vertices which we notate as, 
for example, [v, u>], and we say v is adjacent to w, A 
path in a graph from v\ to Vib is a sequence of ver- 
tices vt, t» 2 , . . , , r* such that is an edge for 

t € {1, . . A: — 1}. A path is a cycle it k > 1 and 

= Vk- An acyclic graph is a graph which contains 
no cycles. A connected graph is a graph such that for 
all pairs of vertices there is a path from v to tv. A 
tree is an acyclic and connected graph. 

Definition 3.2 Let G = (V, E) be a graph and let tv 
be a positive rational valued function defined on E» 
A subtree of <7 is a tree, T{V\ £'), with V* CV and 
C E, We say T spans V* and V* is spanned by 
r. If 1/' = K then we say T is a spanning tree of (7. 
The weight of this tree is u»(e). A minimum 

spanning tree is a spanning tree of minimum weight. 


3.0.1 Data structures and supported opera- 
tions 

Before we discuss the minimum spanning tree algo- 
rithm, we must describe the properties of the principle 
data structure that are required. Since many different 
data structures can be used to implement the algo- 
rithm, we initially describe abstractly the data that 
can be stored by the data structure and the operations 
that can be used to manipulate this data. The data 
consists of a sei of ordered pairs. The first element in 
these ordered pairs is referred to as the item number 
and the second element is caUed the key value. Or- 
dered pairs may be added and removed from the set; 
however, at all times, the item numbers of distinct or- 
dered pairs must be distinct. It is possible, though, 
for multiple ordered pairs to have the same key value. 
In this paper the item numbers arc integers between 1 
and n, inclusive. Our default convention is that i is an 
item number, ib is a key value and h is a set of ordered 
pairs. A total ordering on the psxts of a set can be 
defined lexicographically as foUows: (i, t) < iff 

Jk < Jk' or (Jfe = k' and i < i'). Our data structure 
should support a subset of the following operations. 

member(t^h) returns a boolean value of true if h con- 
tains an ordered pair with item number i, other- 
wise returns false. 

mser£(i, adds the ordered pair {i,k) to the set h, 

delete{i,h) deletes the unique ordered pair with item 
number i from h, 

changekey{i,kth) is executed only when there is an 
ordered pair with item number i in h. This pair 
is replaced by (i, i). 

deleiemin{h) returns the ordered pair which is small- 
est according to the total order defined above 
and deletes this pair. If h is the empty set then 
the token “empty” is returned. 

predecessor{i,h) returns the item number of the or- 
dered pair which immediately precedes the pair 
with item number / in the total ord^-r. If there 
is no predecessor then the token “smallest" is 
returned. 

Many different types and combinations of data struc- 
tures can be used to support these operations effi- 
ciently. In our case, we will actually use two different 
data structure methods to support these operations. 


425 



One mclliod will be used in ihe first execution of the 
algorithm and another, faster and situ pier, method will 
be used in the second execution. I'hc second method 
iclies on a trail of data which is output by the first 
execution. 

3.0.2 MINSPAN alg rithm 

Before discussing precise implementation details for 
these methods we present the overall algorithm used 
in both executions. Pidgin code for this algorithm ap- 
pears below. In addition, Figure 2 illustrates the exe- 
cution of the algorithm on a sample graph and the ta- 
ble below records the data structure operations the al- 
gorithm must perform when run on the sample graph. 
The first column of the table gives the operations ex- 
cept memfccr and with the parameter h dropped to 
reduce clutter. The second column gives the evolving 
contents of h. The third column records the ordered 
pair deleted by the delctcmin operation. The fourth 
column records the certification trail corresponding to 
these operations and is further discussed below. 

The algorithm uses a “greedy" method to “grow" 
a minimum spanning tree. The algorithm starts by 
choosing an arbitrary vertex from w'hich to grovr the 
tree. During each iteration of the algorithm a new 
edge is added to the tree being constructed. Thus, the 
set of vertices spanned by the tree increases by exactly 
one vertex for each iteration. The edge which is added 
to the tree is the one with the smallest weight. Fig- 
ure 2 shows this process in action. Figure 2(a) shows 
the input graph, Figures 2(b) through 2(c) show sev- 
eral stages of the tree grow»th and Figure 2(f) show's 
the final output of the minimum spanning tree. The 
solid edges in Figures 2(b) through 2(c) represent the 
current tree and the dotted edges represent candidates 
for addition to the tree. 

To cfnciently find the edge to add to the current 
tree the algorithm uses the data structure operations 
described above. As soon as a vertex , say v, is ad- 
jacent to some vertex which is currently spanned it is 
inserted in the set h. The key value for v is the w'cight 
of the minimum weight edge bctw'cen v and some ver- 
tex spanned by the current tree. The array clement 
prefer[v) is used to keep track of this minimum w'cight 
edge. As the tree grow's, information is updated by ojv 
crations such as insert(i, Ir, /i) and changckcy{iyk^h). 
The de/ctcmin(/i) operation is used to select the next 
vertex to add to the span of the current tree. Note, 
the algorithm does not explicitly keep a set of edges 



Figure 2: Example for minimum spanning tree algo- j 

rithm. j 

representing the current tree. Implicitly, however, if 

{i\h) is returned by dclciemin then prcftr[v) is added ] 

to the current tree. 

3.0.3 First executiort of MIN SPAN 

In the first execution of the algorithm, Ihe MINSPAK ; 

code is used and the principle data structure is imple- 
mented w'ith a balanced search tree such as an AVL 
tree [1], a red-black tree (12), or a l-lree [5]. Ir addi- 
tion, an array of pointers indexed from 1 to n i.^ used. 

The balanced search tree stores the ordered pairs in h 
and is based on the total order described earlier. The 
array of pointers is initially all nil. For each item 
the ith pointer of the array is used to point to the lo- 


OmomfiL PAGE ffi 
OF POOR QUALITY 




Algorithm MINSPAN(a,u;ei>/iO 

Input: Connected graph G = {,V,E) where V = ,n} 

with edge weights. 

Output; Spanning tree of G which has minimum weight 

1 CHOOSE root € V 

2 FOR ALL tt € V. tey(u) := oo END FOR 

3 h := 0; v := root 

4 WHILE V empty DO 

5 tej/(v) ;= — <50 

6 FOR EACH (», io) € E DO 

7 IF ioeiyht([t., wj) < kep(w) THEN 

8 leyfio) := u»e»yhf((v, u>]); prtfer(w) := [i>, u<] 

9 IF mem6er(tt', A) THEN c/ianfftley(w, iey^w), h) 

10 ELSE mjert(io.jfcey(u»),A) END IF 

11 END IF 

12 END FOR 

13 (I'l A) := de/etemin(A) 

14 END WHILE 

15 FOR ALL u e V - {root}, OUTPUT(pre/er(u)) 

END MINSPAN 

Figure 3: Code for MINSPAN Algorithm 


Operation 

Set of Ordered Pairs 

TraU 

mjert(2,200) 

(2,200) 

smallest 

inser<(6.500) 

(2, 200), (6, 500) 

2 

dcUiemin 

(6,500) 


msert(3,800) 

(6,500), (3,800) 

6 

chanirt4*ey(6,450) 

(6,450),(3,800) 

smallest 

iiweri(7,505) 

(6,450),(7,505),(3,800) 

6 

dtUiemin 

(7,505),(3,800) 


injcrr(5,250) 

(5,250),(7,505),(3,800) 

smallest 

cfianyeJfcey(7,495) 

(5,250),(7,495), (3,800) 

5 

deUiemin 

(7,495), (3,800) 


chanycl;ey(3,350) 

(3,350),(7,495) 

smallest 

injert(4,700) 

(3,350),(7,495),(4,700) 

7 

deUiemin 

(7,495),(4,700) 


cAonycl:ey(4,650) 

(7,495),(4,650) 

7 

deleicfnin 

(4,650) 



de/rtemm 

de/etemm 


Table 1: Data structure operations and certification 
trail for MINSPAN 


cation of the ordered pair with item number i in the 
balanced search tree. If there is no such ordered pair 
in the tree then the tih pointer is nil. This array allows 
rapid execution of operations such as mcm6er(i, /i) and 
dfUte{i,h). 

The certification trail is generated during the first 
execution as follows: When CHOOSE root 6 V' is exe- 
cuted in the first step, the vertex which is chosen is out- 
put. Also, each time insert (i, fc, h) or chanffekey(i, ife, h) 
are executed, predecessor(i, h) is executed afterwards, 
and the answer returned is output. This is illustrated 
in column labeled “Trair in the table above. 

3.0.4 Second execution of MINSPAN 

The second execution of the algorithm also uses the 
MINSPAN code; however, the CHOOSE construct and 
the data structure operations are implemented differ- 
ently than in the first execution. The CHOOSE is 
performed by simply reading the first element of the 
certification trail. This guarantees the same choice of 
a starting vertex is made in both executions. Figure 4 
depicts the principle data structure used which we call 
an indtzti linked list The array is indexed from 1 to n 
and contains pointers to a singly linked list which rep- 
resents the current contents of h. Each element in the 
list stores an ordered pair in h except the head of the 
list which contains the special ordered pair (0, -INF), 
The list is organised such that a traversal from the 
head gives the sorted ordering of the current contents 
of h from smallest to largest. The tth element of the 
array points to the node containing the ordered pair 
with the item number i if it is present in h; otherwise, 
the pointer is nil. The 0th clement of the array points 
to the node containing (0, -INF). Initially, the array 
contains nU pointers except the 0th element. We now 
show how to implement the data structure operations. 

To perform inseri[i^k,h), it is necessary to read 
the next value in the certification trail. This value, 
say is the item number of the ordered pair which is 
the predecessor of (i,Jt) in the current contents of h. 
A new linked list node is allocated and the trail Infor- 
mation is used to inserf the node into the data struc- 
ture. Specifically, the ji\\ arrav pointer is traversed 
to a node in the linked list, say V. (If J = “smallest" 
then the 0th array pointer is traversed.) The new node 
is inserted in the list just after node Y and before the 
next node in the linked list (if there is one). The data 
field in the new node is set to (I, A») and the ilh pointer 
of the array is set to point to the new node. Figure 


page m 

^ eoop quauty 


427 


4 shows lilt insertion of (7,505) into the data slruc* 
lure given that Ihc ccrlificotion trail value is C. Figure 
3(a) is before the insertion and Figure 3(b) is after the 
insertion. 

When the mser( operation is performed, some checks 
must be conducted. First, the ith array pointer must 
be nil before the operation is performed. Second, the 
sorted order of the pairs stored in the linked list must 
be preserved after the operation. That is, if (i',A-') is 
stored in the node before (i, it) in the linked list and 
(»",!:") is stored after (i, i), then (t',/:') < (<*, t) < 
(i",/:") must hold in the total order. If cither of these 
checks fails then execution halts and “error” is output. 

To perform deUU{iJi) the tth array pointer is tra- 
versed and the node found is deleted from the linked 
list. Next, the tth array pointer is set to nil. Figure 
4 shows the deletion of item number 7 if one consid- 
ers Figure 3(a) as depicting the data structure before 
the operation and Figure 3(b) depicting it afterwards. 
When the delete operation is performed one check is 
made. If the ith array pointer is nil before the opera- 
tion then the execution halts and “error" is output. 

To perform changekey[x^kth) it suffices to perform 
delete[i^ h) followed by tnsert(i, k, h). Note, this means 
the next item in the certification trail is read. Also, 
the checks associated with both these two operations 
arc performed and the execution halts with “error" 
output if any check fails. 

To perform deleieTrJn[h) the 0th array pointer is 
traversed, to the head of the list and the next node 
in the list is accessed. If there is no such node then 
“empty” is returned and the operation is complete. 
Otherwise, suppose the node is 1' and suppose it con- 
tains the ordered pair (i,i), then the node Y is deleted 
from the list, the ith array pointer is set to nil, and 
(t, it) is returned. 

Lastly, to perform memher[i^ h) the tth array pointer 
is examined. If it is nil then false is returned, other- 
wise, true is returned. The predccessor[i^ h) operation 
is not used in the second execution. 

This completes the description of the second exe- 
cution. To show that what we have described is a cor- 
rect implementation of the certification trail method 
requires a proof. The proof ha? several parts of varying 
difficulty. First, one must show that if the first execu- 
tion is fault-free then it outputs a minimum spanning 
tree. Second, one must show that if the first and sec- 
ond executions are faull-frcc then they both output 
the same minimum spanning tree. Both these parts of 





Figure 4: Example of the data structure used in the 
second execution of MINSPAN. 

the proof are not difficult to show. 

The third more subtle part of the proof deals with 
the situation in wliich only the second execution is 
fault-free. This means an incorrect certification trail 
may be generated in the first execution. In this case, 
we must show that the second execution outputs ei- 
ther the correct minimum spanning tree or “error”. 

The checks that were described above have been care- ] 

fully designed to assure precisely this property by de- 
tecting any errors that would prevent the execution 
from generating the correct output. Because of space 
restrictions w'e will not give the proof here. 

3.0.5 Time complexity comparisons of the two 
executions 

In the first execution each data structure operation 
can be performed in 0(log(n)) time where \V\ = n. 

There are at most 0(m) such operations and 0(7u) 
additional time overhead where 1E| = m. Thus, the 
first execution can be performed in 0(mlog(n)) We 
note that this algorithm docs not achieve the fastest 
known asymptotic time complexity which appears in 
[lO]. However, the algorillim we have presented has a 
significant! smaller constant of proportionality which 
makes it competitive for rcasonaMy sized graphs. In 
addition, it provides us with a relatively simple and 
illustrative example of the use of a certification trail. 

It should be mentioned that we have developed a more 
complex certification trail solution for an asymptoti- 
cally faster minimum spanning tree algorithm which 
uses fibonacci heaps. 




In the second execution each data structure oper- 
ation can be performed in 0(1). There are still at 
most 0(m) such operations and 0(m) additional time 
overhead. Hence, the second execution can be per- 
formed in 0(m) time. In other words, because of the 
availability of the certification trail, the second execu- 
tion is performed in linear time. There arc no known 
0(m) time algorithms for the minimum spanning tree 
problem. Komlos was able to show that 0(m) com- 
parisons suffice to find the minimum spanning tree. 
However, there is no known 0(m) time algorithm to 
actually find and perform these comparisons. Even 
the related “verification* problem has no known lin- 
ear time solution. In the verification problem the input 
consists of an edge weighted graph and a subtree. The 
ouput is “yes* if the subtree is the minimum spanning 
tree and “no* otherwise. The best known algorithm 
for this problem was created by Tarjan [25] and has 
the nonlinear lime complexity of 0(ma(m, n)), where 
:»(m, n) is a functional inverse of Ackerman’s function. 
The fact that the data in a certification trail enables 
a minimum spanning tree to be found in linear time 
is, we believe, intriguing, significant, and indicative of 
the great promise of the certification trail technique. 

3,1 Concurrency of Executions 

In some cases, it is possible to start the second execu- 
tion before the first execution has terminated. This is 
a highly desirable capability when additional hardware 
is available to run the second execution (for example, 
with multiprocessor machines, or machines with co- 
processors or hardware monitors). 

In the case of the minimum spanning tree prob- 
lem, the two executions can be run concurrently. It 
is only necessary for the second execution to read the 
certification trail as it is generated - one item number 
at a lime. Thus there is a slight time lag in the sec- 
ond execution. This potential for concurrccy has been 
found in other problems we have examined, e.g., the 
Huffman tree problem. 

An additional opportunity for overlapping execu- 
tion occurs when the system has a dedicated compara- 
tor. In this case it b sometimes possible for the two 
executions to send there output to the comparator as 
they generate it. For example, this can be done in the 
niinimum .spanning tree problem where the edges of 
the tree can be sent individually as they are discov- 
ered by both executions. 


4 Comparison of Techniques 

The certification trail approach, whether implemented 
in hardware or software or some combination thereof, 
has resemblances with other fault tolerant techniques 
that have been previously proposed and examined, but 
in each case there are significant and fundamental db- 
tinctions. These distinctions are primarily related to 
the generation and character of the certification trail 
and the manner in which the secondary algorithm or 
system uses the certification trail to indicate whether 
the execution of the primary system or algorithm was 
in error and/or to produce an output to be compared 
with that of the primary system. 

To begin, we compare the certification trail ap- 
proach to N-version programming[8, 4). Thb approach 
specifies that N different implementations of an al- 
gorithm be independently executed with subsequent 
comparison of the resulting N outputs. There is no 
relationship among the executions of the different ver- 
sions of the algorithms other than they all use the 
same input; each algorithm is executed independently 
without any information about the execution of the 
other algorithms. In marked contrast, the certification 
trail approach allows the primary system to generate a 
trail of information while executing its algorithm that 
b critical to the secondary system’s execution of its 
algorithm. In effect, N-version programming can be 
thought of relative to the certification trail approach 
as the employment of a null traii 

A soflware/hardwate fault tolerance technique called 
the recovery block approach [20, 2, 17] uses acceptance 
tests and alternative procedures to produce what is to 
be regarded as a correct output from a program. When 
using recovery blocks, a program is viewed as a being 
structured into blocks of operations which after exe- 
cution yield outputs which can be tested in some in- 
formal sense for correctness. The rigor, completeness, 
and nature of the acceptance test b left to the program 
designer [2]. Indeed, formal methodologies for the def- 
inition and generation of acceptance tests have thus 
far not been fully established. Regardless, the certifi- 
cation trail notion of a s«*conrIary svsftMii llial receives 
the same input as the primary system and executes 
an algorithm that lakes advantage of this trail to effi- 
ciently produce the correct output and/or to indicate 
that the execution of the first algorithm was correct 
does not fall into the category of an acceptance test. 

Recently Blum and Kannan(7| have defined what 
they call a program checker, A program checker is 




OmOINAl. PAGE fS 
OF POOR QUAi.lTV 



algorithm which checks the output of another algo- 
rithm for correctness and thus it is similar to an accep- 
tance test in a recovery block. An example of a pro- 
gram checker is the algorithm developed by Tarjan(25] 
which takes as input a graph and a supposed mini- 
mum spanning tree and indicates w'hcther or not the 
tree actually is a minimum spanning tree. The Blum 
and Kannan checker is actually more general than this 
because it is allowed to be probabilistic in a care- 
fully specified way. There arc two main differences 
between this approach and the certification trail ap- 
proach. First, a program checker may call the algo- 
rithm it is checking a polynomial number of times. In 
our approach the algorithm being checked is run once. 
Second, the checker is designed to w'ork for a prob- 
lem and not a specific algorithm. That is, the checker 
design is based on the input/dutput specification of a 
problem. The certification trail approach is explicitly 
algorithm oriented. In other words, a specific algo- 
rithm for a problem is modified to output a certifi- 
cation trail. This trail sometimes allow's the second 
execution to be faster than any known program check- 
ers for the problem. This is the case for the minimum 
spanning tree problem. 

Space limitations preclude comparisons wdth the 
following other relevant techniques: w’alchdog proces- 
sors [18, 6], algorithm based fault tolerance [13], exe- 
cutable assertions [3]. 

5 Concluding Discussion 

We have presented a new, powerful fault tolerant com- 
puting technique called the certification trail approach. 
Our description of this technique has been only in 
terms of applications to sofU^are fault tolerance, but 
the certification trail approach can also be implemented 
w'ith hardware. We have illustrated the certification 
trail technique by applying it to a minimum spanning 
tree algorithm. The full version of this paper includes 
applications to a Huffman tree algorithm, and a con- 
vex hull algorithm. It should be understood that the 
approach is in no way limited to these algorithms. W> 
believe that our consideration of these algorithms gives 
insight into the significance and desirability of the ap- 
proach. We have found several other algorithms to 
w’hich our techniques apply including an algorithm for 
the shortest path problem and we believe the technique 
W'Ul be widely applicable. W> have also examined the 
general problem of “certifying” data structure opera- 


tions as discussed above and have proven results for 
additional data structures. These results ate impor- 
tant because they allow the certification trail approach 
to be applied to any algorithm which uses one of these 
data structures. 

In the problem discussed an asymptotic speed up 
w'as achieved between the first execution and the sec- 
ond execution which wa.s greater than any constant 
factor. W^e note, how’ever, even if the speed up were 
only by a constant factor, it would still make sense 
to use the technique because execution time would be 
saved. We also note that the certification trail tech- 
nique can be used in conjunction with other software 
fault tolerance techniques. For example, multiple al- 
gorithms can be developed which generate and read 
multiple (but difierent) certification trails. Further, 
these algorithms could be written by separate teams of 
individuals. A general architecture for the interaction 
of these algorithms is an important research topic. For 
example, a “cascade” of algorithms numbered from 1 
to N could be designed such that algorithm i sends 
a certification trail to i -I- 1 which allows f -f 1 to run 
faster than i, WHien errors are detected, other ver- 
sions of algorithms can be invoked which may use an 
earlier certification trail or ignore it. The ideas devel- 
oped in recovery blocks and N-version programming 
among others could be used as guidance in exploring 
such issues. 

References 

[1] Aderson-Vel'skii, G. M., and Landis, E. M., “An 
algorithm for the organization of information*', 
Soviet Math. Doki, pp. 1259-1262, 3, 1962. 

[2] Anderson, T., and Lee, P., Fault tolerance: prin- 
ciples and practices, Prentice-Hall, Englewood 
cuffs, NJ, 1981. 

[3] Andrews, D., “Using excutable assertions for test- 
ing and fault tolerance,” Dig, 9ih Annu. Jni. 
Symp. Fault Tolerant Compuf., pp. 102-105, 1979, 
June 20-22. 

[4] Avizienis, A,, “The N- version approach to fault 
tolerant software,** IEEE Tt'ans, Software En- 
gineering, vol. 11, pp. 1491-1501, Dec., 1985. 

[5] Bayer, R., and MrCrcight, E., “Organization of 
large ordered indexes’*, Acta Inform., pp 173-189, 
1, 1972. 


4.V) 



[6] Blougn, D., and Masson, G., “Performance anal- 
ysis of a generalised concurrent error detection 
procedure,” IEEE Tranj. on CompuUrs vol. 39, 
Jan., 1990. 

[7] Blum, M., and Kannan, S., “Designing programs 
that check their work”, Proceedings of the J9S9 
ACM Symposium on Theory of Compuiing^ pp. 
86-97, ACM Press, 1989. 

[8] Chen, L., and Avizienis A., “N-version program- 
ming: a fault tolerant approach to reliability of 
software operation,” Digest of the J973 Fault Tol- 
erani Computing Sympojium, pp. 3-9, IEEE Com- 
puter Society Press, 1978. 

[9] Dijkstra, E. W., “A note on two problems in con- 
nexion with graphs,” Numer, Math, 7, pp. 269- 
271, Sept., 1959. 

[10] Gabow, H. N-, Galil, Z., Spencer, T., and Tar- 
jan, R. E., “Efficient algorithms for finding min- 
imum spanning trees in undirected and directed 
graphs,” Com6matorica ff, pp. 109-122, 2, 1986. 

[11] Graham, R. L., and Hell, P., “On the history of 
the minimum spanning tree problem,” Ann. Hist, 
Comput.^ pp. 43-47, Jan., 1985. 

[12] Guibas, L. J., and Sedgewick, R., “A dichromatic 
framework for balanced trees”, Proceedings of the 
Nineteenth Annual Symposium on Foundations 
of Computing, pp. 8-21, IEEE Computer Society 
Press, 1978. 

[13] Huang, K.-H., and Abraham. J., “Algorithm- 
based fault tolerance for matrix operations,” 
IEEE Trans, on Computers^ pp. 518-529, vol. C- 
33, June, 1984. 

[14] Johnson, B., Design and analysis of fault tol- 
erant digital systems Addison- Wesley, Reading, 
MA. 1989. 

[15] Kane, J.R. and Yau, S.S., “Concurrent software 
fault detection,” IEEE Trans. Software Eng. , vol. 
SE-1. pp. 87-99, March 1975. 

[16] Komlbs, J., “Linear verification for spanning 
trees”, Proceedings of the J9SI Symposium on 
Foundations of Computing, pp. 201-206, IEEE 
C-omputer Society Press, 1984. 


[17] Lee, Y.H. and Shin, K.G., “Design and evaluation 
of a fault-tolerant multiprocessor using hardware 
recovery blocks," IEEE Trans, Compui,, vol. C-- 
33, pp. 113-124, Feb. 1984. 

[18] Mahmood, A., and McCluskey, E., “Concurrent 
error detection using watchdog processors - a sur- 
vey,” IEEE Trans, on Computers, vol. 37, pp. 
160-174, Feb., 1988. 

[19] Prim, R, C., “Shortest connection networks and 
some generalizations,” Bell Syst, Tech, /., pp. 
1389-1401, Nov., 1957. 

[20] Randell. B., “System structure for software fault 
tolerance.,” IEEE Titans, on Software ^nyineer- 
ing, vol. 1, pp. 220-232, June, 1975. 

[21] Shen, J.P. and Schuette, M.A., “On-line self- 
monitoring using signatured instruction streams,” 
Proc, 1983 Int, Test Con/.,, pp. 275-282, Oct., 
1983. 

[22] Siewiorek, D., and Swan, R., The theory and 
practice of reliable design, Digital Press, Bedford, 
MA, 1982. 

[23] Sridhar, T. and Thatte, S.M., “Concurrent check- 
ing of program flow in VLSI processors,” Dig. 
1932 Int. Test Conf., pp. 191-199, Nov,, 1982. 

[24] Tarjan, R. E., Data Structures and Network Algo- 
rithms, Society for Industrial and Applied Math- 
ematics, Philadelphia, PA, 1983. 

[25] Tarjan, R. E., “Applications of path compression 
on balanced trees”, J. ACM, pp. 690-715, Oct., 
1979. 

[26] Tomas, S. P. and Shen, J. P.. “A roving monitor- 
ing processor for detection of control flow errors 
in multiple processor systems," Proc, IEEE Int, 
Conf, Comput. Design: VLSI Compui,, pp.531- 
539, Oct., 1985. 

[27] Yau, S.S, and Chen. F.-C.. “An approach to con- 
current control flow checking," IEEE Tmus, Soft- 
ware Eng., vol. SE-6. pp. 126-137, .March 1980. 



WIGINAi PAGE IS 
OF POOR QUAury 


ORiQSNAL PAGE IS 
OF POOR QUALITY 




Figure I: Certiflcfttion ir&il method. 


output such that (d,j)6P. 

Definition 2.2 Let P : D — * S be a problem. A solu- 
tion to this problem using a certt/icatton trai7 consists 
of two functions F\ and Fj with the following domains 
and ranges Fj : D S x T and Fj : D x T 
S U {error}. T is the set of certification trails. The 
functions must satisfy the following two properties: 

(1) for all d € D there exists s € S and 

there exists t € T such that 

Fi(d) = (s,t) and Fj(d,t) = j and (d, j) £ P 

(2) for all d 6 D and for all t G T 

either (Fj(d,t) = $ and (d,s) € P) or 
F 2 (d,t) = error. 


We also require that Fi and Fj be implemented 
so that they map elements which are not in their re- 
spective domains to the error symbol. The definitions 
above assure that the error-detection capability of the 
^cttification-tra3 approach is simOar to that obtained 
'^th the simple time-redundancy approach discussed 
^lier. (That is, if transient hardware faults occur 
ounng only one of the executions then either an er- 
*or will be detected or the output will be correct.) It 
shoiUd be further noted, however, the examples to be 
^asidered will indicate that this new approach can 
save overall execution time. 


Observant readers of our earlier paper [11] in which 
® uUroduced the notion of a certification trail might 
ave noticed that our certification-trail solution for the 
J^-spanning tree was generalisable. The generalised 
p^nique allows one to generate a certification trail 
many algorithms which use a balanced binary tree 
effi ^ *^^^cture. However, the technique relies on the 
^ cie^ execution of the predecessor operation and 
me data structures such as heaps cannot execute 
operation efficiently. The techniques 
tibed in this paper are even more general and pew- 
'll and they do apply to heaps. 

degree of diversity or independence achieved 
Using certification trails depends on how they 


are used. A fuUer discussion of this and of the re- 
lationship between certification trails and other ap- 
proaches to software fault tolerance is contained in the 
expanded version of [II]. This current paper presents 
asymptotic analysis which shows that the certiheation- 
trail approach is desirable even when the overhead of 
generating the certification-trail is included. We are 
currently working on an experimental analysis of the 
method and initial results are quite promising. 


3 Answer- Validation Problem for 
Abstract Data Types 

Otir genera] approach to applying certification trails 
uses the concept of an abstract data type. Some exam- 
ples of abstract data types are given later in this paper. 
Here we mention some important common properties 
and give a short illustration. Each abstract data type 
has a well defined data object or set of data objects, 
and each abstract data type has a carefully defined fi- 
nite collection of operations that can be performed on 
its data object(s). Each operation takes a finite num- 
ber of arguments (possibly sero), and some but not 
all operations return answers. An example of an ab- 
stract data type is a priority queue. The data object 
for a priority queue is an ordered pair of the form (i,k) 
where i is an item number and k is a key value, A pri- 
ority queue has two operations: insert(i,k) and delmin. 
The insert operation has two arguments: item number 
i and key value k. The insert operation does not return 
an answer. The delmin operation has no arguments, 
but it does return an answer. The precise semantics 
of these operations are given later in this paper. 

For each abstract data type we define an aruwer- 
validaiion problem. Intuitively, the answer validation 
problem consists of checking the correctness of a se- 
quence of supposed answers to a sequence of opera- 
tions performed on the abstract data type. More for- 
mally, the input to the answer- valid at ion problem is 
a sequence of operations on the abstract data type 
together with the arguments of each operation. In ad- 
dition, the sequence contains the supposed answers for 
each of the operations which return answers. In par- 
ticular, each supposed answer is paired with the oper- 
ation that is supposed to return it. Examples of such 
inputs are given in the columns labelled “Operation" 
and “Answer" of table 1 and table 2. 

The output for the answer- validation problem is 
the word “correct" if the answers given in the input 
match the answers that would be generated by actually 
performing the operations. The output is the word 
“incorrect" if the answers do not match. It is also 
useful to allow the output word to say “ill- formed". 
This output is used if the sequence of operations is ill- 
formed, c.g., an operation has too many arguments or 
an argument refers to an inappropriate object. 


^33 


PIWe«)*N« MGE BLANK NOT FUMED 



The Twenty-First International Symposium on Fault-Tolerant Computing (1^ 


Certification Trails for Data Structures 

Gregory F. Sullivan* 

Gerald M. Masson^ 

Dept, of Computer Science, Johns Hopkins Univ,, Baltimore, MD 21218 


Abstract 


Certification trails are a recently introduced and promis- 
ing approach to fault detection and faxilt tolerance [11)* 
In this paper, we significantly generalise the applica- 
bility of the certification trail technique. Previously, 
certification trails had to be customised to each algo- 
rithm application, but here we develop trails appro- 
priate to wide classes of algorithms. These certifica- 
tion trails are based on common data-structure oper- 
ations such as those carried out using balanced binary 
trees and heaps. Any algorithm using these sets of 
operations can therefore employ the certification trail 
method to achieve software fault tolerance. To exem- 
plify the scope of the generalisation of the certification 
trail technique provided in this paper, constructions of 
trails for abstract data types such m priority queues 
and union-find structures will be given. These trails 
are applicable to any data-structure implementation of 
the abstract data type. It will also be shown that these 
ideas lead naturally to monitors for data-structure op- 
erations. 

Keywords: Software fault tolerance, error monitor- 
ing, certification trails, design diversity, data struc- 
tures. 


1 Introduction 

In this paper we significantly generalize the novel and 
powerful certification-trail technique for achieving fault 
tolerance in systems that was introduced in [11]. Al- 
though applicable to both hardware and software, we 
restrict our discussion of the certification-trail tech- 
nique in the foUowdng to software fault tolerance. To 
explain the essence of the certification -trail technique 
for software fault tolerance, we will first discuss a sim- 
pler fault- tolerant software method. In this method 
the specification of a problem is given and an algo- 
rithm to solve it is constructed. This algorithm is ex- 
ecuted on an input and the output is stored. Next, 
the same algorithm is executed again on the same in- 
put and the output is compared to the earlier output. 
If the outputs differ then an error is indicated, other- 
wise the output is accepted as correct. This software 
fault tolerance method requires additional time, so- 
called time redundancy [8, 10]; however, it requires no 

* Reteaj’xK p&rti&llv supported by NSF Greats CCR-8910S69 
a&d CCR-8908092. 

^Research partially supportedby NASA Grant NSG 1442. 


additional software. It is particularly valuable for de- . 
tecting errors caused by transient fault phenomena. If 
such faults cause an error during only one of the ex- 
ecutions then either the error will be detected or the 
output will be correct. The second possibility, of unde- 
tected faults, occurs when the output of the execution 
is unaffected by the faults. 

The certification-trail technique is designed to ob- 
tain similar types of error-detection capabilities but 
expend fewer resources. The central idea, as illus- 
trated in Figure 1, is to modify the first algorithm 
so that it leaves behind a trail of data which we caU t 
ccriificciiofi troi/. This data is chosen so that it can al- 
low the the second algorithm to execute more quickly 
and/or have a simpler structure than the first algo- 
rithm. As above, the outputs of the two executions 
are compared and arc considered correct only if they 
agree. Note, however, w*e must be careful in defining 
this method or else its error detection capability might 
be reduced by the introduction of data dependency 
between the two algorithm executions. For example, 
suppose the first algorithm execution contains an error 
wrhich causes an incorrect output and an incorrect trail 
of data to be generated. Further suppose that no error 
occurs during the execution of the second algorithm. It 
still appears possible that the execution of the second 
algorithm might use the incorrect trail to generate an 
incorrect output w'hich matches the incorrect output 
given by the execution of the first algorithm. Intu- 
itively, the second execution would be “fooled” by the 
data left behind by the first execution. The definitions 
we give below exclude this possibility. They demand 
that the second execution cither generate a correct an- 
swer or signal that an error has been detected in the 
data trail. 




Formal Definition of a Certi- 
fication Trail 


In this section we will give a formal definition of a 
certification trail and discuss some aspects of its real- 
izations and uses. 

Definition 2.1 A problem P is formalired as a rela- 
tion, i.e., a set of ordered pairs. Let D be the domain 
(that is, the set of inputs) of the relation P and let 
S be the range (that is, the set of solutions) for the 
problem. We say an algorithm A solves a problem P 
iff for all d £ D when d is input to A then an * € S is 


CH2985-0/91/0000/0240/$01.00 © 1991 IEEE 

43*4 

PHiasmmfi PAGE BLANK NOT FILMED 


Thf ant wcr- valid at ion probicm is timDar to the 
idea of an acceptance test which is used in the recovery- 
block approach (9, 2) to software fault tolerance. , The 
main difference is that an answ^er- validation problem^ is 
dependent upon a sequence of answers, not just an in- 
dividual answer. Hence, if an incorrect answer appears 
in the sequence, it may not be detected immediately. 
It is guaranteed, however, that an incorrect answer 
will be detected at some point during the processing 
of the entire sequence. By allowing for this latency in 
detection, it is possible to create a much more efficient 
procedure for solving the answer-validation problem. 

In this paper we shall solve the validation problem 
for two abstract data types. In the fuU version of this 
paper we solve the answer- valid at ion problem for more 
general data types [12]. 

The most important aspect of the answer-validation 
problem is that it is often possible to check the cor- 
rectness of the answer.^ to a sequence of operations 
much more quickly than actuaUy calculating whai the 
answers should be from scratch. In other words, the 
answ’er- validation problem has a smaller time complex- 
ity than the original abstract-data- type problem. For 
example, to calculate the answers to a sequence of n 
priority-queue operations takes n(nlog(n)) time, how- 
ever it is possible to check the correctness of the an- 
swers in only 0{n) time. This speedup is very useful 
in fault-detection applications. 

It is possible to run an answer- validation algorithm 
for some abstract data type concurrently w'ith some 
algorithm which uses the abstract data type. The 
answer-validation algorithm could act as a monitor 
making sure that all interactions wdth the abstract 
data type arc handled correctly. This is valuable be- 
cause many algorithms spend a large fraction of their 
time operating on abstract data types. Note, the over- 
head of this monitor is less than the overhead of ac- 
tually performing the data-type operations a second 
time. 

One possible application of the answer- validation 
problem occurs w’hcn it is used in conjunction with a 
repairable data structure which allow's for repair but 
docs not automatically attempt to deled faults (16). 
Suppose an abstract data type is implemented with 
a repairable data structure. One can use an answer- 
validation procedure to detect errors in the answers 
generated by the abstract data type. When an er- 
ror is detected, a repair of the data structure can be 
attempted. In some cases, recovery and continued ex- 
ecution will be possible. 

In the next section, we will show how to create cer- 
tification trails for programs which use abstract data 
types when those data types have efficient solutions 
for their answer-validation problems. 


4 Schema for using Certification 
Trails 

Suppose that we have developed an efficient solution to 
the answer- validation problem for some abstract dat% 
tyn*. By efficient we mean the lime complexity of 
till answer- validation problem is smaller than the lim^ 
complexity of the original abstracl-data-type problem. 
Further, suppose that we wish to run an algorithm, 
say A, w'hich uses that abstract data type. To apply 
the certification trail method we can use the following 
schema to yield the two executions: 

First Execution: 

Execute algorithm A. 

Each lime an abstracl-data-type operation is performe-, 
append to the certification trail the identity of the op- 
eration, the arguments and the answer. 

Second execution: 

Phase One: 

Validate the correctness of the operations and sup- 
posed answ’crs giver, in the certification trail. If the 
validation returns ‘'incorrect” or “ill-formed” then out- 
put “error” and stop. Otherwise, continue. 

Phase Two: 

Execute algorithm A. 

Each time an abstract-daia-typc operation is performed, 
read the next entry in the certification trail. Make sure 
that the operation and the arguments in the certifica- 
tion trail agree with those requested in the algorithm. 
If not output “error” and stop. Otherwise, use the 
answer given in the certification trail and continue. 

In the final step the outputs from the iwo execu- 
tions are compared and the output is accepted or an et- 
ror is signaled. This schema car. yield execution times 
which are significantly faster than the execution lime 
obtained by running algorithm A twice, yet these two 
mclhodc give similar fault detection capabilities. That 
iv, if transient hardware faults occur during or.^y one 
of the executions then cither an error will be detected 
or the output will be correct. Noic, the first execution 
can be slower than a simple execution of algorithm 
A since it must output a certification trail. How^cver, 
the second execution can be significantly faster than 
a simple execution of the algorithm since the interac- 
tions with the abstract data type take less lime overall. 
The p-l effect can be a major spe/'dup. 

Suppose an algorithm uses multiple abstract data 
types and suppose there are efficient answ’er-validation 
algorithms for each of these abstract data types. It U 
easy to sec how our method generalizes. We can leave 
behind a generalized certification trail which consists 
of a separate certificatior trail for each of the abstraci 
c ata types. The effect on the speedup of the second 
execution will be cumulative. 


page »iank not filmed 



Figure 2: Union Tree and with Find Edges 





r 

* 

I 


! 




r 



5 Answer Validation for Disjoint 
Set Union 

As our first example we will discuss the disjoint-set 
union problem. This problem concerns a dynamic col- 
lection of sets in whi^ pairs of sets can be combined 
to yield new sets. The underlying universe of set el- 
ements consists of the integers from 1 to n inclusive. 
Also, the universe of set names consists of the integers 
&om 1 to n inclusive. There are three operations that 
ean be performed: 


create(A,x) creates a singleton set named A which 
contains element x. Since sets must be disjoint we 
require that x not already be in some set. 

union(A,B) creates a new set which is the union 
of the sets named A and B. This new set is called A 
^nd the set named B becomes undefined. It is required 
that the sets named A and B are originally defined and 
«e disjoint. 

find(x) returns the name of the set which contains 
clement x. It is required that x be a member of some 
Unique set. 

If an operation violates one of the requirements 
<lescribed above then it is considered to be ill-formed. 
Also, if an operation has the wrong number or type of 
^fguments it is considered to be ill-formed. 

In table 1 we give an example of a sequence of 
^joint-set-union operations together with the answers 
for find operations. In addition, the coDection of sets 

epic ted as it is changed by the operations. For sim- 
l^“eity, in this example each set name corresponds to 
the integer originally contained in the set when it is 
created. Sets arc listed by first giving the name of the 
»et followed by a colon and then the contents of the 
•et. 

The dujoint-set-union problem is a classic problem 
^hich has many applications [4] such as the off-line 


Operation Answer Status of sets 


crcate(l,l) 
create(2,2) 
union(l,2) 
find(2) 
create(3,3 
create(4,4 
create(5,5 
union(5,3 
union(5,l 
find(2) 
find(5) 
create(6,6) 
union(4,6) 
create(7,7) 
union(4,7) 
flnd(6) 


5 

5 


i:py 

l:jl},2;{2} 


) iH 

[1.2), 3H 

[3} 

) 

;i.2, ,3H 

|3},4: 


,1.2},3H 

,3}.4: 

IH 

1.2}.4H 

4>,6:. 


[4}.5:{1.2,3.5) 


<i,5: 

3,5} 


{ 5 } 


4: 

4: 

4: 

4: 


4},5:{1,2,3,5},6:{6} 

4.6),5:(1.2,3,5) 

4,6},5:{1.2,3.5}.7:{7> 

4,6.7}.5:{1,2.3,5> 


Table 1: Sequence of operations for a Disjoint Set 
Union 


min problem, connected components, least-common 
ancestors, and equivalence of finite automata. Of par- 
ticular interest is the time-complexity of performing a 
sequence of operations. Let us say the total number of 
operations is m, which is assumed to be greater than 
or equal to n. Recall, n is the number of set elements 
and set names. 

Tarjan gave the tight upper bound of O(mor(m, n)) 
[13, 14] for this problem. The a refers to the inverse 
of Ackermann's function which is a very slowly grow- 
ing function. His solution and earlier solutions used 
a path-compression heuristic (IS). Fredman and Saks 
gave a lower bound of fi(ma(m, n)) [5] in a general 
cell-probe modcL Gabow and Tarjan show how to 
solve some important special cases of this problem in 
0(m) time [6]. 

We now consider the answer-validation problem for 
the disjoint-set-union data type. We will show that 
this problem can be solved in 0(m) time where m 
is the number of operations. Note, this time com- 
plexity is superior to the complexity of actually per- 
forming the sequence of operations as discussed above. 
One method for solving this problem in 0(m) lime 
uses the powerful techniques of Gabow and Tarjan (6). 
However, we shall present a simpler method with a 
small constant of proportionality that is tailored to 
this problem. 

To solve this problem we will buUd a forest based 
on the union operations in the sequence. In addition, 
we shall add ^ges to this forest based on the find 
operations. As a final step we will perform a traversal 
of the forest and perform appropriate checks. The solid 
edges in figure 2 indicate the forest we would build for 


the »el of operations given in table 1, The dashed 
edges indicate the edges we would add to the forest 
based on the find operations. 

Algorithm for Answer Validation for Disjoint- 
Set Union 

Input: sequence of m operations together with argu- 
ments and supposed answers for the disjoint>set union 
data type. 

Output: “correct”, “incorrect” or “ill-formed” 

Declarations: Type treenode has fields left and right. 
Type tree/ea/ contains a list of pointers such that each 
pointer points to a treenode or a treeleaf. Array ac- 
iivesei is indexed by set name. Each array element is 
a pointer to a treenode or a treeleaf. Array whertis is 
indexed by an element number. Each array clement 
is a pointer to a treeleaf. Initially, all pointers are nil 
and lists are null. 

In the first phase of the algorithm we process each op- 
eration as it appears serially using the following rules: 

crcate(A,x): If activcsct[A] or w’hcrcis[x] are non-nil 
then output “ill-formed” and stop. Otherwbe, allocate 
a treeleaf and set activeset[A] and whereis[x] to the 
allocated node. 

union(A,B): If activcset[A] or activeset(B] are nil then 
output “ill-formed” and stop. Otherwise, allocate a 
treenode and set left to activeset[A] and right to ac- 
tiveset[B]. Next set activcset[A) to the treenode and 
set activcsct[B] to nil. 

find(x) A: (where A is the supposed answer to the 
find.) If whcreis[x] is nil then output “ill- formed”. 
Otherwise, whereis(x] points to some treeleaf. Call it 
tleaf. If activcset[A] is nil then output “ill-formed”. 
Otherwise, activeset[A) points to some treeleaf or trecn- 
ode. Call it t. Add a pointer to t to the list of pointers 
contained in treeleaf. 

In the second phase of the algorithm we shall traverse 
the structure we have built. 

Scan thru the array activeset to find non-nil pointers. 
It is not hard to see that each non-nil pointer points 
to the root of a tree made up of nodes of type tnodc 
and tleaf. The tree uses the edges in the left and right 
fields of tnode. 

For each such tree perform a depth-first search. When- 
ever the search reaches a node of type tleaf traverse 
the list of pointers that it contains. Check that each 
pointer points to a node w hich is currently on the stack 
which is used to perform the depth-first searcxi. This is 
equivalent to checking that each pointer in tleaf points 
to a node which is an ancestor of tleaf in the tree. 

If some pointer does not point to an ancestor then out- 
put “incorrect” and stop. Otherwise, output “correct” 
and stop. 


Theorem 5.1 The algorithm for an#u»er ua/idatton of 
the disjoint •$€( •union abiiract data iypt is corrtci. 

Theorem 5.2 The answer validation algorithm for rf,,, 
joint id unton has a time complexity of 0(m) for pro- 
ccssing a le^ucncc of m operations. 

Wc omit these two Ihcorenu which overall are not 
difficult to show. Wc comment on one aspect of in^. 
plemeniation. In the second phase of the answer vali- 
dation algorithm it is necessary to determine if certain 
nodes arc on the stack during the tree traversal. This 
can be done efficiently as follows: First, each treen- 
ode and each treeleaf can be assigned a unique iden- 
tifier in the range 1 to m as it is allocated. Next, « 
boolean vector of siic m indexed by the unique iden- 
tifiers described above can be allocated. This vector 
can be used to keep track of which nodes are on the 
stack during tree traversal by turning biis on and off. 
This modified tree traversal algorithm still takes 0{m) 
time. 

6 Generalized Priority Queue 

W'c now describe a somewhat general abstract data 
type. We wiD solve the answer validation problem for 
restricted versions of this data type. The data consists 
of a set of ordered pairs. The first element in these or- 
dered pairs is referred to as the item number and the 
second element is called the key value. Ordered pairs 
may be added and removed from the set, however, at 
all times the item numbers of distinct ordered pairs 
must be distinct. It is possible, though, for multiple 
ordered pairs to have the same key value. In this pa- 
per the iicm numbers are integers between 1 and n, 
inclusive. Our default convention is that i is an item 
number, t is a key value and h it a set of ordered pairs. 
A total ordering on the pairs of a set can be defined 
lexicographically as follows: (i, iff i < fc' 

or (/f = Hr' and i < i'). The abstract data types wc W’ill 
consider support a subset of the following operations. 

mcmbcr(i) returns a boolean value of true if the set 
contains an ordered pair with item number i, 
otherwise returns false. 

insert (i, Hr) adds the ordered pair (t, Hr) the set. We 
require that no other pair with item number / he 
in the set. 

delctc(x) deletes the unique ordered pair with i:em 
number i from the set. We require that a ]>air 
with item number i be in the set initially. 

changekey(i, Hr) is executed only when there is an or- 
dered pair with item number t in the set. This 
pair is replaced by (i, Hr). 


244 


T Op«ialion Answer Validation stack 

1 uisert(6,300) 

2 insert(2,404) 

3 insert(3,250) 

4 deletemin (3,250) (3,250,4) 

5 insert(10,248) 

0 insert(12,245) 

7 insert(4,260) 

8 deletemin (12,245) (12, 245, 8), (3, 250, 4) 

9 insert(13,140) 

10 insert(5,142) 

11 deletemin (13,140) (13.140, 11), (12,245, 8),(3,250, 4) 

12 deletemin (5,142) (5.142, 12).(12,245,8),(3, 250,4) 

13 deletemin (10,248) (10,248,13), (3,250, 4) 

14 deletemin (4,260) (4,260,14) 


Table 2: Sequence of Priority Queue operations illus- 
trating answer validation algorithm 


Each operation is time-stamped, i.e«, the opera- 
tions are assigned integers sequentially starting with 
1 which is easy to do with a counter. The answer- 
validation algorithm uses a stack called deletestack. 
The contents of this stack arc illustrated in table 2. 
The top of the stack is on the left in table 2« 

Let us consider the kinds of tests that an answer- 
validation algorithm for a priority queue might per- 
form. Suppose (i,k) is the answer to some deletemin 
operation. Further, suppose (i',k') was deleted in a 
previous deletemin operation. If the priority queue is 
correct then either (i,k)>(i',k') or (i',k') was deleted 
before (i,k) was inserted* This suggests that the time 
of insertion and deletion for elements should be recorded 
and the algorithm below docs this. Unfortunately, if 
an algorithm compares an ordered pair which has been 
deleted against all previously deleted ordered pairs 
then the algorithm complexity is at least 0(m*). To 
avoid this the deletestack is used. The deletestack was 
designed to allow many comparisons to be done im- 
plicitly and to reduce the complexity. 


deletemin (or deletemax) returns the ordered pair which 
is smallest (or largest) according to the total or- 
der defined above and deletes this pair. If the 
set is empty then the token “empty" is returned. 

min (or max) returns the ordered pair which is small- 
est (or largest) according to the total order de- 
fined above. If the set is empty then the token 
"empty" is returned. 

If an operation violates one of the requirements de- 
scribed above then it is considered to be ill-formed. 
Also, if an operation has the wrong number or type of 
arguments it is considered to be ill-formed. 

Many different types and combinations of data struc- 
tures can be used to support different subsets of these 
operations efficiently. 

7 Answer Validation for Prior- 
ity Queue 

We will first consider the priority-queue abstract data 
type which allows only two operations: insert and 
deletemin. An example of a sequence of such oper- 
ations appears in table 2. Many different data struc- 
tures can be used to implement priority queues includ- 
ing heaps [17], balanced search trees such as AVL trees 
(1), red-black trees [7), or b-trees [3]. It is possible to 
process a sequence of 0(n) operations in O(nlog(n)) 
time using the data structures above. Furthermore, 
Iherc is a lower bound of n(nlog(n)) because it is pos- 
rible to sort using a priority queue. Remarkably, the 
answer-validation problem can be solved using only 
0(n) time, as documented below. 


Algorithm for Answer Validation for Priority 
Queue 

Input: sequence of 0(n) operations together with ar- 
guments and supposed answers for the priority-queue 
data type. 

Output: “correct*, “incorrect" or “ill- formed* 

Declarations: Array called inseritime indexed by item 
number. Array elements contain either “absent" or 
a time-stamp. Array caUed keyvalue indexed by item 
number. Array clcmcnls contain either “absent" or 
a key value. Initially, each element in these two ar- 
rays contains “absent”. Stack of ordered triples called 
dtUitsiack, Each ordered triple has the following form: 
first clement is an item number, second clement is a 
key value, and third element is a lime-stamp, delctcs- 
tack is initially empty. 

In the first phase of the algorithm we process each op- 
eration as it appears serially using the following rules: 

Let currenttime refer to the time-stamp of the opera- 
tion being processed. 

insert(i,k): If insert time [i]^ “absent" then output “ill- 
formed" and stop. Otherwise, let Inserttimep] = cur- 
rcnllimc and let keyvalue[i)=k. 

deletemin (i.k): (where (i,k) is the supposed answej 
to the deletemin operation.) If inscrtiime(i) = “absent'* 
or keyvalue[i] 9 ^k then output "ill- formed" and stop. 

Otherwise, let (i^k0 be the item number and key 
number of the triple on the lop of deletestack (if there 
is one). Repeatedly pop the stack until (i,k)<(i',k') or 
until deletestack is empty. 

If deletestack is empty then push the triple 
(i,k, currenttime) onto deletestack. Further, let insert* 


timc[i]= •‘absent’* and let kcyvaluclijs^absent** and pro- 
cess the next priority queue operation. 

If deletestack is non-empty then 
be deletetime'). If in5crttimc[i 

output •‘incorrect’* and stop. Ot , 

triple (i,k, currenttime) onto deletestack. Next, lei in- 
5 crtUme[i]= •‘absent’* and let key value [i]= “absent" and 
process the next priority queue operation. 


et the top element 
<deletetime' then 
herwise, push the 


In the second phase of the algorithm we operate 
on the items which have been inserted but have never 
been deleted. 


Scan the array inserttime and for each item number 
for which inserttime[i]9t^ “absent" construct an ordered 
triple (i,keyvalue(i|,inserltimc[i]). Call this set of or- 
dered triples remainders. 

Use a bucket sort to sort the triples in remainders by 
their time-stamps, i.e., the third clement of the ordered 
triple. 

Merge the triples in remainders together with the triples 
in deletestack so that they are all ordered by their 
time-stamps, i.e., the third clement of the ordered 
triple. 

Scan the combined triples to determine if there exist 
two triples which satisfy the following: inscrttime[i]< 
dcletetime' and (i,keyvaluc[i])<(i’,k'); where one triple 
is from remainders and has the form (i,keyvalue[i], 
inserttime[i]) and where the other triple is from delctes- 
tack and has the form (i',k',deletetime'); 

If these two triples exist then output “incorrect” and 
stop. Otherwise output “correct" and stop. 


Theorem 7,1 The algorithm for answer validation of 
the priority ^ueue abstract data type is correct. 

Proof: Clearly the algorithm for answer validation 

always terminates. We must show that the algorithm 
outputs “correct" iff the operations together with ar- 
guments and supposed answers are correct. Because of 
space limitations we will only give a proof for the more 
difRcult half of this iff statement. We shall use a proof 
by contradiction. Assume that the sequence of opera- 
tions, arguments and supposed answers is considered 
correct by the algorithm but actually is incorrect. The 
use of the array inserttime and the symbol “absent" 
assures that no item is deleted when it is absent or in- 
serted when it is already present. The use of the array 
keyvalue assures that items do not change kcyvalue 
when they are present in the data type set. There is 
only one remaining way in which a sequence can be 
incorrect. This occurs when an ordered pair is deleted 
by a deletemln operation, however, it docs not really 
have the smallest key value. 

This means, there exist ordered pairs (ii,ki) and 
(Wtkj) such that (ii,ki)>(ij,k2) and (ii,ki) is deleted 


while (ij,k:) is present in the data lype sel. In addj* 
lion, we may specify that {i|,k|) is the largest orderfd 
pair deleted while (isik}) is present. Let ins} be the 
time that \\ was inserted and let delj be the time that 
ii was deleted. Let insj be the time that was ii). 
serted and let del: be the time that ij deleted (if 
it was deleted). There are two cases. 

Case 1: Suppose that (ijik]) is ultimately deleted. 
We know that (ii,ki)>(ij,lt2) by assumption. del:><lelj 
since item i: is deleted aftci item ii. ins:<deb since 
item i: was present w'hcn item ij was deleted. 

Consider the situation when item 33 is deleted with 
a deletemin operation. The ordered triple for item i^ 
must appear in deletestack just before the processing 
of the i: deletion operation. This follows because the 
triple for item i: can only be removed from deletestack 
by a larger element and yet (ii,ki) refers to the largest 
ordered pair deleted while (istks) was present. Now, 
since (ii,ki)>(i3,k3) the ordered triple for item will 
remain in deletestack even after deletestack is popped 
during the processing of the deletemin operation for 
item 13. Suppose the top of deletestack is (ia,k3,del«) 
after the popping. 

It is easy to show that the time-stamps on deletes- 
tack are monotonically ordered with the largest time- 
stamp at the top. For this reason we know that 
dcl3>dcli. We noted earlier that deli>ins:. But if 
in$3<del3 then the algorithm outputs “incorrect" when 
it processes the deletemin operation. This contradicts 
our assumption that the sequence of operations, ar- 
guments and supposed answers was considered correct 
by the algorithm. 

Case 2: Suppose the ordered pair (i3,k3) is never 
deleted. In the second phase of the algorithm the or- 
dered triple (i3,k2,ins3) is constructed and is compared 
against the ordered triples in deletestack. 

The same argument that was used in case 1 above 
can be used to show that the test performed in the 
second phase of the algorithm would delect a problem 
and cause “incorrect" to be output. This contradicts 
our assumption that the sequence of operations, argu- 
ments and supposed answers was considered correct by 
the algorithm. Since both cases lead to a contradiction 
our proof is complete. | 


Theorem 7.2 The answer validation algorithm for pn- 
ority queue has a time complexity of 0 (n) for process- 
ing a sequence of 0 (n) operations. 

Proof: We first analyse phase one of the algorithm. 

Note, there is a constant amount of work done for pro- 
cessing each single operation if we exclude the cost of 
popping the deletestack. Interestingly, popping the 
deletestack can take 0{n) time for the processing of 
a single operation. Luckily, the total amortixed com- 
plexity for popping the deletestack while processing a 
sequence of 0(n) operations is still only 0(n). This 


246 


is true because each item which is inserted and later 
deleted is placed on deletestack and is popped at most 
once. 

We now consider phase two. The cost of array 
scanning and constructing the triples is 0(n). The 
cost of the bucket sort is 0(n) and the cost of the 
merge is also 0(n). The final test can be implemented 
with a simple scan with a complexity of 0(n). Hence 
the overall complexity is 0(n) | 

We have solved the answer-validation problem for 
abstract data structures that support the following set 
of operations; member, insert, delete, deletemin, min, 
deletemax, and max. The algorithm used to solve this 
problem is intricate but efficient. It requires only 0(n) 
time to process 0(n) operations. A detailed descrip- 
tion of our solution, however, is beyond the scope of 
this version of the paper. 

8 Conclusions 

The results reported in this paper significantly gen- 
eralize the applicabiUty of the certification-trail tech- 
nique, In our previously reported work on certification 
trails [11], we had to customize each algorithm appli- 
cation, but we have now developed trails appropriate 
to wide classes of algorithms. These certification trails 
arc based on common data-structure operations such 
as those carried out using balanced binary trees and 
heaps. Any algorithm using these sets of operations 
can therefore employ the certification trail method to 
achieve software fault tolerance. To express the full 
generality of these ideas, we have provided construc- 
tions of trails for abstract data types such as priority 
queues and union-find structures. These trails are ap- 
plicable to any data-structure implementation of the 
abstract data type. These ideas lead naturally to mon- 
itors for data-structure operations. We arc currently 
'forking on an experimental evaluation of the approach 
und initial results are promising. 

Iteferences 

[1] Adel’son-Verskii, G. M.. and Landis, E. M., “An 
algorithm for the organization of information’’, 
Soviet Math, Doki, pp. 1259-1262. 3, 1962. 

[2] Anderson, T., and Lee, P., Fault tolerance: prin- 
ciples and practices^ Prentice- Hall, Englewood 
Cliffs, NJ, 1981. 

[3] Bayer, R,, and McCreight, E., “Organization of 
lurge ordered indexes’*, Acta Inform., pp 173-189, 
1, 1972. 


[4] Cormen, T. H., and Leiserton, C. E., and Rivest, 
R. L., /ntroduction to Algorititms McGraw-Hill, 
New York. NY, 1990. 

[5] Ftedman, M. L., and Saks, M. E., “The cell probe 
complexity of dynamic data structures,” Proc, 
21 it ACM Symp, on Theo. Comp, 19B9^ pp. 109- 
122, 2, 1986. 

[6] Gabow, H. N., and Tarjan, R. E., “A linear-time 
algorithm for a special case of disjoint set union,” 
/. of Comp, and Sy$, 5ct., 30(2), pp. 209-221, 
1985. 

[7] Guibas, L. J., and Sedgewick, R., “A dichromatic 
framework for balanced trees”, Proceedings of the 
Nineteenth Annual Symposium on Foundations 
of Computing, pp. 8-21, IEEE Computer Society 
Press, 1978. 

[8] Johnson, B,, and analysis of fault tol- 

erant digital systems Addison- Wesley, Reading, 
MA, 1989. 

[9] Randcll, B., “System structure for software fault 
tolerance," IEEE Trans, on Software Engineer- 
ing, vol. 1, pp. 220-232, June, 1975. 

[10] Sicwiorck, D., and Swarz, R., The theory and 

practice o/re/io6/e Digital Press, Bedford, 

MA, 1982. 

[11] Sullivan, G.F., and Masson, G.M., “Using cer- 
tification trails to achieve software fault toler- 
ance," Digest of the 2990 Fault Tolerant Com- 
puting Symposium, pp. 423-431, IEEE Computer 
Society Press, 1990. 

[12] Sullivan, G.F., and Masson, G.M., “Certification 
trails for data structures,” Department of Com- 
puter 5ctence Technical Report JHU 90/11, Johns 
Hopkins University, Baltimore, Maryland, 1990. 

[13] Tarjan, R. E., “Efficiency of a good but not linear 
set union algorithm,” /. ACM, 22(2), pp. 215-225, 
1975. 

[14] Tarjan, R. E., “A class of algorithms which re- 
quire nonlinear time to maintain disjoint sets," J, 
of Comp, and Sys, Sci,, 18(2), pp. 110-127, 1979. 

[15] Tarjan, R. E., and Leeuwen, J. van. “Worst-case 
analysis of set union algorithms," J, ACM, 31(2). 
pp. 245-281, 1984. 

[16] Taylor, D., “Error Models for robust data struc- 
tures,” Dig, 20th Annu, Ini, Symp, Fault Tolerant 
Comput,, pp. 416-422, 1990 June 26-28. 

[17] Williams, J. W. J, “Algorithm 232 (heapsort),” 
Commun, of ACM, vol. 7, pp. 347-348, 1964. 


